INDEX
Explanations
possessive forms and contractions
New Auto-Interp
Negative Logits
combe
-0.16
aktu
-0.15
ugh
-0.15
anzi
-0.15
inson
-0.14
اÙĦات
-0.14
inha
-0.14
ino
-0.14
Colo
-0.14
isan
-0.14
POSITIVE LOGITS
safe
0.39
safe
0.31
Safe
0.29
fair
0.29
Safe
0.28
-safe
0.27
hard
0.24
fair
0.23
_safe
0.23
SAFE
0.21
Activations Density 0.091%