INDEX
Explanations
references to falsehoods, lies, and misleading statements in the context of honesty and integrity
New Auto-Interp
Negative Logits
BorderColor
-0.50
onha
-0.49
жкой
-0.48
ngths
-0.46
affinity
-0.46
uygun
-0.46
нибудь
-0.46
ırken
-0.43
اعمال
-0.43
prefer
-0.43
POSITIVE LOGITS
falsehood
1.07
liar
1.03
perjury
0.94
lied
0.93
liars
0.92
Lies
0.92
Lies
0.88
untrue
0.88
lies
0.88
false
0.88
Activations Density 0.359%