INDEX
Explanations
statements related to power dynamics, discrimination, and conspiracy theories
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1842
+0.23
0.7%
1343
+0.14
0.4%
198
+0.12
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1842
+0.23
0.08
198
+0.14
0.07
1261
+0.12
0.02
Negative Logits
<bos>
-1.53
rungsseite
-0.60
autorytatywna
-0.54
Normdatei
-0.54
انجليز
-0.52
Debido
-0.52
HFILL
-0.52
IVEREF
-0.51
disambiguazione
-0.51
webElementXpaths
-0.49
POSITIVE LOGITS
Abbé
0.89
ordina
0.88
Ordre
0.85
carrefour
0.83
ecclesias
0.81
ivi
0.81
Ottobre
0.80
Aéroport
0.79
Confe
0.79
Bibl
0.78
Activations Density 0.844%