INDEX
Explanations
phrases related to conspiracy theories and extremist ideologies
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1741
+0.27
1.0%
50
+0.23
0.8%
2019
+0.16
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
16
+0.27
0.08
50
+0.23
0.06
1287
+0.16
0.04
Negative Logits
shouldBe
-0.62
įsi
-0.59
rayures
-0.57
apsau
-0.54
gardien
-0.52
attaques
-0.51
nė
-0.51
spreken
-0.51
naudoti
-0.51
virš
-0.51
POSITIVE LOGITS
magis
0.84
tamen
0.77
palab
0.75
Läh
0.74
Mitä
0.74
mef
0.73
ibi
0.71
reputa
0.70
Sklici
0.70
priva
0.69
Activations Density 0.345%