INDEX
Explanations
references to hate groups and their leaders
New Auto-Interp
Negative Logits
Fuse
-0.17
fection
-0.16
.Sdk
-0.15
765
-0.15
Mul
-0.14
421
-0.14
Fuse
-0.14
defe
-0.14
oppel
-0.14
ä»ĭ
-0.14
POSITIVE LOGITS
racist
0.25
rac
0.23
Ku
0.22
Charlottesville
0.22
KK
0.21
-Nazi
0.21
Klan
0.21
Hitler
0.21
supremacist
0.21
rac
0.20
Activations Density 0.201%