INDEX
Explanations
mention of terrorist activities or groups
references to terrorism
New Auto-Interp
Negative Logits
endum
-0.77
galitarian
-0.74
dit
-0.68
bye
-0.66
Quartz
-0.66
resso
-0.65
flush
-0.63
Grape
-0.63
laus
-0.62
Salt
-0.62
POSITIVE LOGITS
fully
0.98
abad
0.95
attacks
0.88
spree
0.87
efully
0.86
istan
0.86
raids
0.83
fulness
0.82
istani
0.82
bombing
0.81
Activations Density 0.011%