INDEX
Explanations
statements or claims made by individuals
New Auto-Interp
Negative Logits
annon
-0.16
maj
-0.15
ायà¤ķ
-0.15
him
-0.15
ckt
-0.14
andler
-0.14
emi
-0.14
Samar
-0.14
tl
-0.14
iesen
-0.14
POSITIVE LOGITS
rive
0.16
Lance
0.14
uces
0.14
znik
0.14
inda
0.13
whole
0.13
llu
0.13
oust
0.13
svc
0.13
ction
0.13
Activations Density 0.040%