INDEX
Explanations
references to anti-Government or anti-establishment sentiments
New Auto-Interp
Negative Logits
stead
-0.74
è¯
-0.71
hips
-0.70
ourced
-0.69
bos
-0.66
therap
-0.65
theless
-0.64
anyl
-0.64
antically
-0.63
externalToEVAOnly
-0.63
POSITIVE LOGITS
Fa
0.90
strate
0.80
zac
0.77
war
0.77
hero
0.75
hesis
0.73
Monitor
0.72
Dhabi
0.71
Age
0.69
abuse
0.69
Activations Density 0.006%