INDEX
Explanations
references to individuals or groups identified as victims
New Auto-Interp
Negative Logits
ings
-0.18
erie
-0.17
enta
-0.16
egot
-0.15
oes
-0.15
mates
-0.15
ØŃÙĬ
-0.15
bons
-0.15
é¢ĺ
-0.15
ROS
-0.15
POSITIVE LOGITS
hood
0.25
ized
0.25
/target
0.20
atically
0.19
ively
0.19
izers
0.19
ology
0.19
ization
0.17
IZED
0.17
ised
0.16
Activations Density 0.024%