INDEX
Explanations
conspiratorial and politically charged language
New Auto-Interp
Negative Logits
Reincarn
-0.74
Seym
-0.74
limb
-0.73
Stras
-0.73
condem
-0.72
tabloid
-0.72
Tanz
-0.70
Vaugh
-0.69
Citiz
-0.68
landmarks
-0.68
POSITIVE LOGITS
ï¸ı
1.25
lean
1.01
company
0.97
vernment
0.95
Balt
0.93
lime
0.93
ever
0.93
wow
0.92
agree
0.91
pol
0.90
Activations Density 6.015%