INDEX
Explanations
incidents involving hate crimes or violence against marginalized groups
New Auto-Interp
Negative Logits
â̦
-0.71
[â̦]
-0.63
â̦.
-0.60
[â̦
-0.56
â̦↵↵
-0.50
[â̦]↵↵
-0.43
â̦.
-0.43
â̦â̦â̦â̦
-0.41
â̦â̦
-0.40
â̦..
-0.39
POSITIVE LOGITS
...↵
0.69
,...↵
0.56
....↵
0.49
...↵↵
0.47
',...↵
0.42
...↵
0.40
..."↵
0.40
...
0.39
,...↵↵
0.38
...)↵
0.38
Activations Density 0.453%