INDEX
Explanations
the presence of special characters or formatting elements in the text
New Auto-Interp
Negative Logits
Diſ
-0.85
themſelves
-0.84
Conſ
-0.83
itſelf
-0.80
Inſ
-0.79
raiſ
-0.79
myſelf
-0.78
himſelf
-0.78
juſ
-0.78
ſta
-0.78
POSITIVE LOGITS
aDecoder
0.48
d
0.47
MessageOf
0.46
en
0.45
Griswold
0.44
mphony
0.44
czę
0.43
Dragon
0.42
Hentet
0.42
ist
0.42
Activations Density 0.001%