INDEX
Explanations
references to conspiracy theories and their proponents
New Auto-Interp
Negative Logits
bourg
-0.88
pex
-0.84
emouth
-0.81
atal
-0.76
amen
-0.73
emale
-0.72
furt
-0.71
artney
-0.70
Delivery
-0.69
ophon
-0.68
POSITIVE LOGITS
theories
1.14
theorists
0.92
theorist
0.91
debunk
0.90
concoct
0.86
conspiracy
0.84
theor
0.84
abound
0.81
explanations
0.81
twist
0.80
Activations Density 0.021%