INDEX
Explanations
mentions of fake news and related terms
references to "fake news" and misinformation
New Auto-Interp
Negative Logits
aird
-0.75
dues
-0.75
illes
-0.72
ktop
-0.71
atri
-0.70
onding
-0.68
foreseen
-0.68
airo
-0.67
anse
-0.67
emale
-0.66
POSITIVE LOGITS
ument
1.11
disinformation
1.02
misinformation
1.01
propag
1.00
perpetrated
1.00
pedd
0.98
concoct
0.97
falsehood
0.97
nonsense
0.95
debunked
0.93
Activations Density 0.197%