INDEX
Explanations
mentions of fake news
references to fake news
New Auto-Interp
Negative Logits
xual
-0.85
Marginal
-0.73
pai
-0.73
hem
-0.72
endez
-0.69
night
-0.69
Reviewer
-0.68
served
-0.67
waukee
-0.66
sqor
-0.66
POSITIVE LOGITS
news
1.00
IDs
0.96
ument
0.96
NEWS
0.87
identities
0.81
pas
0.80
positives
0.80
tails
0.80
News
0.78
news
0.73
Activations Density 0.074%