INDEX
Explanations
terms related to deception or falsehoods
New Auto-Interp
Negative Logits
weeney
-0.81
undai
-0.80
accompan
-0.80
odes
-0.79
icts
-0.77
laws
-0.77
ses
-0.76
izont
-0.74
ippers
-0.73
ographics
-0.73
POSITIVE LOGITS
concoct
1.13
invented
1.02
perpetrated
0.91
unworthy
0.90
mir
0.86
excuse
0.83
gimmick
0.82
because
0.81
devoid
0.81
conceived
0.80
Activations Density 0.072%