INDEX
Explanations
mentions of threats, especially those related to violence or national security
mentions of threats, particularly those related to violence and national security
New Auto-Interp
Negative Logits
mys
-0.82
ilts
-0.79
arist
-0.76
orph
-0.70
porter
-0.68
Band
-0.67
otide
-0.66
baum
-0.66
NAS
-0.66
puted
-0.65
POSITIVE LOGITS
threats
1.32
threat
1.06
undermin
0.96
threat
0.93
challeng
0.92
Threat
0.89
threaten
0.88
proble
0.88
threatening
0.83
menacing
0.83
Activations Density 0.009%