INDEX
Explanations
mentions of the word "anti" followed by a subsequent word
references to anti-Semitic sentiments or themes
New Auto-Interp
Negative Logits
mable
-0.81
ccording
-0.77
tremend
-0.76
manship
-0.72
doms
-0.72
matically
-0.71
corrid
-0.67
rall
-0.65
skelet
-0.65
rul
-0.64
POSITIVE LOGITS
anti
1.09
ucci
0.92
Devi
0.86
iso
0.86
opsis
0.85
zona
0.84
qua
0.82
pora
0.78
oco
0.78
ctr
0.77
Activations Density 0.013%