INDEX
Explanations
phrases related to criticism and attacks towards individuals or groups
aggressive language or terms associated with criticism and attacks
New Auto-Interp
Negative Logits
pection
-0.76
hazard
-0.72
cano
-0.72
duct
-0.70
poral
-0.70
lycer
-0.70
yip
-0.69
earchers
-0.69
trap
-0.69
kj
-0.69
POSITIVE LOGITS
critics
1.01
liberals
0.93
feminists
0.93
commenters
0.91
fellow
0.90
Republicans
0.90
politicians
0.89
Islam
0.88
environmentalists
0.88
gays
0.87
Activations Density 0.220%