INDEX
Explanations
confidently fabricated information
New Auto-Interp
Negative Logits
描写
0.66
داری
0.66
ণিজ্য
0.65
ጾ
0.65
Window
0.65
submenu
0.64
Dom
0.64
型
0.63
Interrupt
0.63
programming
0.63
POSITIVE LOGITS
falsehood
2.08
disinformation
1.89
debunk
1.78
hoax
1.75
credibility
1.74
거짓
1.72
skepticism
1.72
disbelief
1.72
misinformation
1.72
myths
1.68
Activations Density 0.378%