INDEX
Explanations
mentions of lying or related terms
expressions related to dishonesty and deceit
New Auto-Interp
Negative Logits
aldi
-0.85
ugal
-0.76
joining
-0.76
orsi
-0.75
allows
-0.74
FN
-0.72
ains
-0.69
illed
-0.69
obs
-0.69
ategory
-0.69
POSITIVE LOGITS
detector
1.01
uten
0.94
detectors
0.77
deceit
0.76
utenant
0.76
vulner
0.74
liar
0.74
misrepresent
0.73
acies
0.73
deceive
0.72
Activations Density 0.021%