INDEX
Explanations
comparative phrases, especially those indicating superiority or preference
New Auto-Interp
Negative Logits
outu
-0.19
rou
-0.15
axy
-0.15
sworth
-0.15
irim
-0.15
sw
-0.14
aly
-0.14
ählen
-0.14
Forge
-0.14
gó
-0.14
POSITIVE LOGITS
ige
0.17
ever
0.16
á»ķ
0.16
olet
0.15
dozen
0.15
oler
0.14
urret
0.14
usual
0.14
FD
0.14
ovies
0.14
Activations Density 0.057%