INDEX
Explanations
references to white supremacist organizations and individuals
New Auto-Interp
Negative Logits
Fuse
-0.16
Strip
-0.15
hostages
-0.15
fection
-0.15
Orm
-0.15
765
-0.14
hostage
-0.14
Fuse
-0.14
bourgeois
-0.14
oppel
-0.13
POSITIVE LOGITS
Ku
0.28
neo
0.25
-Nazi
0.24
Klan
0.24
white
0.23
Neo
0.22
Charlottesville
0.22
Neo
0.21
Odin
0.21
supremacist
0.21
Activations Density 0.105%