INDEX
Explanations
punctuation marks, particularly periods and parentheses, that indicate the structure of code
New Auto-Interp
Negative Logits
pleaſure
-0.73
itſelf
-0.67
ſelves
-0.65
ſtate
-0.62
Inſ
-0.61
ſelf
-0.60
ſtand
-0.59
ſte
-0.58
themſelves
-0.57
becauſe
-0.56
POSITIVE LOGITS
').
1.11
").
1.10
()).
1.03
'').
1.00
"].
0.97
').
0.96
]").
0.96
__).
0.96
").
0.91
"").
0.88
Activations Density 0.130%