INDEX
Explanations
phrases indicating negation or the absence of something
New Auto-Interp
Negative Logits
¹
-0.15
ovich
-0.15
alo
-0.14
rik
-0.14
476
-0.14
rick
-0.14
nt
-0.13
alers
-0.13
rape
-0.13
íĦ¸
-0.13
POSITIVE LOGITS
match
0.28
longer
0.23
xious
0.23
-match
0.22
Buen
0.21
match
0.21
different
0.20
Match
0.20
substitute
0.19
Match
0.19
Activations Density 0.020%