INDEX
Explanations
punctuation marks and formatting symbols in the text
New Auto-Interp
Negative Logits
greateſt
-1.13
itſelf
-1.10
purpoſe
-1.02
pleaſure
-1.01
myſelf
-1.00
themſelves
-0.99
ſever
-0.98
fubject
-0.96
Reſ
-0.96
ſind
-0.94
POSITIVE LOGITS
↵↵
0.96
↵↵↵
0.77
The
0.67
</h3>
0.65
</blockquote>
0.64
↵
0.60
"
0.58
↵↵↵↵
0.57
or
0.55
)
0.55
Activations Density 0.586%