INDEX
Explanations
terms associated with dishonesty and political narratives
New Auto-Interp
Negative Logits
ungi
-0.16
CHandle
-0.15
ewis
-0.14
ungan
-0.14
luk
-0.14
byt
-0.14
eree
-0.14
(Handle
-0.14
ehler
-0.14
िषय
-0.14
POSITIVE LOGITS
lie
0.68
lies
0.66
lying
0.61
Lie
0.56
lie
0.53
Lies
0.52
Lie
0.49
fib
0.48
lied
0.47
liar
0.47
Activations Density 0.236%