INDEX
Explanations
threatening language or mentions of threats
mentions of threats to safety or security
New Auto-Interp
Negative Logits
igs
-0.72
Cups
-0.70
OME
-0.70
neys
-0.68
neau
-0.67
Stores
-0.66
Gins
-0.65
å¤
-0.64
abee
-0.63
abeth
-0.63
POSITIVE LOGITS
threat
3.75
threats
2.86
threat
2.85
Threat
2.52
menace
2.35
danger
1.90
threatening
1.80
threaten
1.79
threatened
1.67
risk
1.44
Activations Density 0.018%