INDEX
Explanations
expressions and instances related to hate speech and its consequences
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.09
3:0.05
4:0.05
5:0.03
6:0.33
7:0.06
8:0.03
9:0.03
10:0.17
11:0.09
Negative Logits
将
-1.46
*/(
-1.43
生
-1.42
pole
-1.31
NEY
-1.28
baseman
-1.28
版
-1.28
ESA
-1.26
itte
-1.24
aple
-1.24
POSITIVE LOGITS
intimidation
1.47
terrorism
1.31
itives
1.21
hate
1.20
terror
1.19
upload
1.17
illance
1.16
Cthulhu
1.16
vandalism
1.16
bullying
1.15
Activations Density 0.005%