INDEX
Explanations
concepts related to truth and deception in discourse
New Auto-Interp
Negative Logits
zcze
-0.15
ville
-0.15
oni
-0.15
idis
-0.15
inc
-0.15
abr
-0.15
od
-0.14
ama
-0.14
اخت
-0.14
zl
-0.14
POSITIVE LOGITS
reminded
0.18
ieder
0.17
uetype
0.17
ekk
0.16
ngth
0.16
oldem
0.16
issan
0.15
Exactly
0.15
anmar
0.15
remind
0.15
Activations Density 0.006%