INDEX
Explanations
phrases indicating deception or disguising intentions
New Auto-Interp
Negative Logits
'options
-0.17
mort
-0.15
omik
-0.14
agnostic
-0.14
.mods
-0.14
ÃĸÄŁren
-0.14
ofile
-0.14
atoms
-0.14
골
-0.14
.lift
-0.14
POSITIVE LOGITS
excuses
0.19
excuse
0.18
icht
0.17
justification
0.17
justify
0.17
ảng
0.16
901
0.16
arger
0.15
ady
0.15
claimed
0.14
Activations Density 0.230%