INDEX
Explanations
references to dishonesty and deception
New Auto-Interp
Negative Logits
Mu
-0.63
ClientSize
-0.61
A
-0.56
帖最后由
-0.56
“
-0.54
↵↵
-0.54
massimo
-0.53
a
-0.51
Gra
-0.51
Mu
-0.50
POSITIVE LOGITS
lie
1.45
liar
1.45
Lying
1.41
lies
1.36
lied
1.34
LIE
1.33
lying
1.31
liars
1.30
Liar
1.26
Lies
1.23
Activations Density 0.194%