INDEX
Explanations
instances of denial or contradiction in statements
New Auto-Interp
Negative Logits
oun
-0.17
/wiki
-0.15
etsk
-0.15
contrary
-0.14
emens
-0.14
contr
-0.14
aci
-0.14
hrad
-0.14
rá
-0.13
Contr
-0.13
POSITIVE LOGITS
nevertheless
0.20
nonetheless
0.17
æŃ¡
0.15
amon
0.14
stuck
0.14
istra
0.14
Essentially
0.14
ÑĦакÑĤ
0.14
Nevertheless
0.13
AMS
0.13
Activations Density 0.163%