INDEX
Explanations
instances of high activation on punctuation marks, especially periods
New Auto-Interp
Negative Logits
ugin
-0.18
unn
-0.15
@Id
-0.15
maduras
-0.14
sober
-0.14
universal
-0.14
plan
-0.14
Ãłu
-0.13
annel
-0.13
ative
-0.13
POSITIVE LOGITS
krit
0.16
oka
0.15
forman
0.14
781
0.14
andle
0.14
inu
0.14
vig
0.14
gary
0.13
ppe
0.13
fault
0.13
Activations Density 0.185%