INDEX
Explanations
instances of threats and intimidation directed towards individuals or groups
New Auto-Interp
Negative Logits
gorit
-0.17
itech
-0.16
оваÑĢ
-0.14
ocker
-0.14
emo
-0.14
inha
-0.14
inflicted
-0.13
ocyte
-0.13
FAST
-0.13
.React
-0.13
POSITIVE LOGITS
threats
0.33
intimid
0.25
threat
0.25
intimidation
0.25
targeted
0.24
Threat
0.24
-threat
0.23
safety
0.23
harassment
0.23
threatened
0.22
Activations Density 0.154%