INDEX
Explanations
words related to deception and dishonesty
New Auto-Interp
Negative Logits
asco
-0.18
ustin
-0.15
ções
-0.15
breakout
-0.15
Amph
-0.15
Interactive
-0.14
loo
-0.14
Tar
-0.14
jn
-0.14
Stick
-0.14
POSITIVE LOGITS
base
0.18
grading
0.18
human
0.18
kul
0.17
adece
0.16
Base
0.16
plr
0.15
BASE
0.15
omon
0.15
prec
0.15
Activations Density 0.029%