INDEX
Explanations
references to "fake news" and discussions about media trustworthiness
New Auto-Interp
Negative Logits
ollar
-0.21
ollo
-0.16
utzer
-0.16
tick
-0.15
ullan
-0.15
uze
-0.15
Toll
-0.15
rame
-0.15
athom
-0.14
toll
-0.14
POSITIVE LOGITS
é³
0.14
æŃ©
0.14
290
0.14
uiltin
0.14
Kral
0.14
pec
0.13
igkeit
0.13
Amp
0.13
Qed
0.13
unga
0.13
Activations Density 0.059%