INDEX
Explanations
words related to consistency or typicality
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
605
+0.09
0.2%
2030
+0.08
0.2%
1307
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
2030
+0.09
0.04
361
+0.08
0.04
420
+0.07
0.05
Negative Logits
solidar
-0.89
ideolog
-0.88
kram
-0.88
erd
-0.83
robus
-0.83
makro
-0.82
utop
-0.79
ohr
-0.79
paus
-0.77
gesta
-0.77
POSITIVE LOGITS
paradiso
0.64
fidèles
0.64
chrétiens
0.62
dovr
0.61
compagn
0.60
fameux
0.57
appartamento
0.57
prêtres
0.57
tempio
0.57
bénéfice
0.56
Activations Density 0.285%