INDEX
Explanations
phrases indicating negation or denial
New Auto-Interp
Negative Logits
HW
-0.17
anova
-0.16
938
-0.16
ijo
-0.16
iston
-0.15
jon
-0.15
onomies
-0.15
sg
-0.15
iom
-0.15
uhe
-0.15
POSITIVE LOGITS
æĪ¶
0.15
tas
0.15
zar
0.15
odiac
0.15
.Stack
0.15
ìĿ´íĬ¸
0.15
enos
0.14
ffset
0.14
aklı
0.14
vũ
0.14
Activations Density 0.000%