INDEX
Explanations
instances where someone is being accused of deception
instances of the word "lying" to indicate dishonesty or falsehood
New Auto-Interp
Negative Logits
Ultra
-0.77
ugal
-0.72
FN
-0.72
ORE
-0.72
entry
-0.71
era
-0.71
Effective
-0.71
aud
-0.71
ISO
-0.70
aldi
-0.70
POSITIVE LOGITS
lying
0.94
horizont
0.89
liar
0.80
lie
0.79
skelet
0.78
seiz
0.77
lied
0.77
pills
0.75
mortg
0.74
camoufl
0.73
Activations Density 0.006%