INDEX
Explanations
references to political figures and issues related to falsehoods in media
New Auto-Interp
Negative Logits
Telegram
-0.18
Telegram
-0.16
202
-0.16
ĨĴ
-0.15
arl
-0.15
masks
-0.15
deg
-0.14
gal
-0.14
ï¿
-0.14
747
-0.14
POSITIVE LOGITS
icus
0.18
ãĥ¼ãĥª
0.17
prites
0.16
Uvs
0.16
dued
0.15
Meadow
0.15
nicos
0.14
htar
0.14
icum
0.14
apiro
0.14
Activations Density 0.263%