INDEX
Explanations
terms related to dishonesty and falsehoods
New Auto-Interp
Negative Logits
mise
-0.17
ric
-0.17
shal
-0.17
ialized
-0.14
(att
-0.14
.generated
-0.14
mor
-0.14
scaling
-0.14
gaard
-0.14
ello
-0.13
POSITIVE LOGITS
/false
0.24
ushima
0.16
ocrat
0.16
fulness
0.16
about
0.16
HostException
0.15
urous
0.15
iveness
0.15
ulence
0.15
itious
0.14
Activations Density 0.058%