INDEX
Explanations
words related to deception or dishonesty
New Auto-Interp
Negative Logits
interrupted
-0.82
nas
-0.72
IO
-0.69
ENTS
-0.69
LESS
-0.67
YL
-0.67
ians
-0.67
cake
-0.66
IAN
-0.65
upon
-0.65
POSITIVE LOGITS
azy
1.28
eker
1.08
igh
1.05
uth
0.97
eper
0.93
asure
0.90
avement
0.89
pload
0.87
aving
0.85
aping
0.84
Activations Density 0.019%