INDEX
Explanations
contradictions or contrasts in statements
New Auto-Interp
Negative Logits
chter
-0.15
ichen
-0.15
irts
-0.15
argar
-0.15
ÏĦζ
-0.14
amet
-0.14
rán
-0.14
pub
-0.13
ug
-0.13
egal
-0.13
POSITIVE LOGITS
actually
0.28
Actually
0.24
actually
0.22
Actually
0.22
åħ¶å®ŀ
0.18
Nope
0.16
ensa
0.16
xFFF
0.16
eigentlich
0.15
ething
0.15
Activations Density 0.163%