INDEX
Explanations
words or phrases that indicate instances of deception or betrayal
New Auto-Interp
Negative Logits
362
-0.15
Strauss
-0.14
274
-0.14
Vend
-0.14
ë¡ľëĵľ
-0.14
enth
-0.14
eba
-0.14
minority
-0.13
illis
-0.13
174
-0.13
POSITIVE LOGITS
sh
0.25
enan
0.17
INCIDENT
0.16
alink
0.16
ETHER
0.16
.sh
0.16
vier
0.15
abby
0.15
sh
0.15
ort
0.15
Activations Density 0.024%