INDEX
Explanations
references to tabloid news outlets and sensational stories
New Auto-Interp
Negative Logits
esser
-0.19
CBC
-0.15
olen
-0.15
olis
-0.14
UsersController
-0.14
Olymp
-0.14
ost
-0.14
oufl
-0.13
ะ
-0.13
ullen
-0.13
POSITIVE LOGITS
vio
0.18
#ad
0.16
iid
0.15
iw
0.14
wik
0.14
idity
0.14
lal
0.14
над
0.14
Lal
0.14
atchet
0.14
Activations Density 0.018%