INDEX
Explanations
disinformation misinformation fake news
New Auto-Interp
Negative Logits
촬영
0.44
Pain
0.42
introd
0.41
穷
0.41
Stim
0.41
ছুই
0.41
adventurer
0.41
algèbre
0.39
queleto
0.39
🌃
0.39
POSITIVE LOGITS
disinformation
1.78
misinformation
1.72
fake
1.34
propaganda
1.34
Fake
1.25
Fake
1.22
falsehood
1.20
fake
1.14
propagand
1.13
Propaganda
1.10
Activations Density 0.024%