INDEX
Explanations
references to persecution or mistreatment
New Auto-Interp
Negative Logits
inho
-0.17
oids
-0.15
ucle
-0.15
itest
-0.15
oid
-0.14
iad
-0.14
ural
-0.14
ald
-0.14
sophistic
-0.14
atura
-0.14
POSITIVE LOGITS
by
0.19
ë°Ľ
0.17
dorf
0.16
تÙĪØ³Ø·
0.15
oleh
0.15
ress
0.15
undi
0.15
227
0.15
applied
0.14
inator
0.14
Activations Density 0.263%