INDEX
Explanations
mentions of or references to fake information or news
references to "fake news"
New Auto-Interp
Negative Logits
xual
-0.87
ires
-0.75
ands
-0.75
arching
-0.73
}}}
-0.70
APTER
-0.70
anded
-0.69
draw
-0.68
Thom
-0.68
azar
-0.66
POSITIVE LOGITS
pas
0.80
fake
0.79
²¾
0.76
ument
0.74
Fake
0.72
ãĥ¼ãĥĨãĤ£
0.68
eln
0.68
ulously
0.67
monster
0.67
outs
0.67
Activations Density 0.022%