INDEX
Explanations
references to information that is considered false or deceptive
references to "fake news."
New Auto-Interp
Negative Logits
xual
-0.81
hens
-0.73
arching
-0.73
APTER
-0.71
azar
-0.71
}}}
-0.70
Discuss
-0.68
ands
-0.68
ires
-0.67
Reviewed
-0.66
POSITIVE LOGITS
fake
0.86
²¾
0.83
pas
0.83
phony
0.72
bait
0.71
ument
0.70
Fake
0.70
reef
0.70
ulously
0.67
eln
0.67
Activations Density 0.016%