INDEX
Explanations
phrases related to false information or deceit
references to fake news
New Auto-Interp
Negative Logits
xual
-0.95
hem
-0.75
night
-0.73
served
-0.73
riott
-0.71
onential
-0.69
onen
-0.66
pai
-0.65
guiActiveUnfocused
-0.64
interrupted
-0.64
POSITIVE LOGITS
ument
1.02
news
0.94
IDs
0.87
pas
0.87
positives
0.81
ulent
0.79
NEWS
0.73
identities
0.70
ulence
0.70
outs
0.67
Activations Density 0.055%