INDEX
Explanations
terms related to misinformation or deceptive information
terms related to misleading or deceptive information
New Auto-Interp
Negative Logits
estones
-0.72
ahime
-0.72
erg
-0.72
emed
-0.71
ternal
-0.66
neighbor
-0.64
Beat
-0.63
github
-0.63
ero
-0.62
nder
-0.61
POSITIVE LOGITS
misleading
3.53
mislead
2.75
deceptive
2.64
misled
2.44
dece
2.29
deceive
2.05
misrepresent
1.98
deceived
1.91
deception
1.90
misinformation
1.73
Activations Density 0.034%