INDEX
Explanations
references to white supremacist ideologies and groups
New Auto-Interp
Negative Logits
ronic
-0.17
utr
-0.17
rug
-0.15
nette
-0.14
iro
-0.14
opp
-0.14
argest
-0.14
opping
-0.14
inue
-0.14
inde
-0.14
POSITIVE LOGITS
groups
0.28
Groups
0.23
-groups
0.20
groups
0.19
Groups
0.19
(groups
0.18
_groups
0.18
organizations
0.17
group
0.16
Odin
0.16
Activations Density 0.038%