INDEX
Explanations
instances of dishonesty and deception
New Auto-Interp
Negative Logits
sar
-0.16
illet
-0.16
оÑĢов
-0.15
lify
-0.15
tron
-0.15
chner
-0.15
.scalablytyped
-0.15
ILON
-0.14
408
-0.14
grily
-0.14
POSITIVE LOGITS
told
0.24
about
0.20
uten
0.16
_about
0.16
-flat
0.16
sacks
0.16
detector
0.16
-about
0.16
berman
0.15
utenant
0.15
Activations Density 0.024%