INDEX
Explanations
words related to deception or false information
mentions of hoaxes and pranks
New Auto-Interp
Negative Logits
bourg
-0.68
Borders
-0.68
aws
-0.66
ailable
-0.65
uv
-0.65
uner
-0.62
udeau
-0.62
bilt
-0.62
oyal
-0.61
asper
-0.60
POSITIVE LOGITS
hoax
1.03
sters
0.87
²¾
0.84
erella
0.81
ually
0.80
ishly
0.79
edly
0.79
es
0.78
ulence
0.75
ed
0.75
Activations Density 0.029%