INDEX
Explanations
phrases and terms related to falsehoods and misinformation
New Auto-Interp
Negative Logits
ialized
-0.16
ikk
-0.16
lesi
-0.15
Bias
-0.14
igel
-0.14
adiens
-0.14
igs
-0.14
íĻ©
-0.14
tones
-0.14
ุà¸ķ
-0.14
POSITIVE LOGITS
/false
0.16
ocrat
0.16
claim
0.16
premises
0.15
.localized
0.14
aina
0.14
HD
0.14
ktop
0.14
claims
0.14
омеÑĢ
0.14
Activations Density 0.089%