INDEX
Explanations
terms and phrases related to deception or dishonesty
New Auto-Interp
Negative Logits
ê´
-0.16
ylko
-0.15
Singleton
-0.14
Pazar
-0.14
bek
-0.14
Ø¡
-0.14
γκε
-0.14
ê´
-0.14
禮
-0.13
atar
-0.13
POSITIVE LOGITS
968
0.15
rah
0.15
ukt
0.14
olis
0.14
arto
0.14
chers
0.14
Tre
0.14
ARCH
0.13
Fi
0.13
duplic
0.13
Activations Density 0.033%