INDEX
Explanations
phrases that imply deception or manipulation in communication
New Auto-Interp
Negative Logits
opsis
-0.16
boz
-0.16
roz
-0.15
ellas
-0.15
uen
-0.15
lem
-0.14
uptools
-0.14
ument
-0.14
ipa
-0.14
Comm
-0.14
POSITIVE LOGITS
etti
0.17
меж
0.15
Alv
0.15
-Ta
0.14
oders
0.14
aghan
0.14
.getP
0.14
λικά
0.14
ees
0.13
Ỽp
0.13
Activations Density 0.206%