INDEX
Explanations
references to deception or dishonesty
New Auto-Interp
Negative Logits
оÑĢов
-0.16
-blind
-0.16
sar
-0.16
íģ¼
-0.16
.scalablytyped
-0.15
tron
-0.15
ousse
-0.14
blind
-0.14
meer
-0.14
blind
-0.14
POSITIVE LOGITS
uten
0.24
utenant
0.20
about
0.18
_about
0.17
detector
0.17
-flat
0.17
chten
0.17
urance
0.16
Lie
0.16
Detector
0.16
Activations Density 0.016%