INDEX
Explanations
phrases related to misinformation, deception, and false information
New Auto-Interp
Negative Logits
hens
-0.75
anse
-0.74
foreseen
-0.72
arya
-0.68
area
-0.68
aldo
-0.67
foundation
-0.67
guiActiveUnfocused
-0.67
illes
-0.67
winner
-0.67
POSITIVE LOGITS
ument
1.05
ulent
0.96
ulence
0.90
falsely
0.88
excuse
0.87
excuses
0.84
pretext
0.82
concoct
0.77
pas
0.75
false
0.74
Activations Density 1.225%