INDEX
Explanations
phrases indicating disagreement or opposition
New Auto-Interp
Negative Logits
themselves
-0.16
odont
-0.15
IRO
-0.15
iro
-0.14
ilk
-0.13
mess
-0.13
ikan
-0.13
ilib
-0.13
’aut
-0.13
opsis
-0.13
POSITIVE LOGITS
itself
0.20
its
0.18
urbed
0.17
unes
0.16
Its
0.16
own
0.15
erot
0.15
lef
0.14
Its
0.14
sum
0.14
Activations Density 0.227%