INDEX
Explanations
words related to social or political issues, particularly related to culture, politics, and legislation
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1678
+0.14
0.4%
32
+0.12
0.4%
478
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
32
+0.14
0.11
1678
+0.12
0.10
68
+0.10
0.08
Negative Logits
Denote
-0.55
mirador
-0.54
fortæ
-0.52
ilangkan
-0.49
bahaya
-0.48
lcm
-0.48
barcel
-0.47
revisa
-0.47
uklu
-0.46
ropshire
-0.46
POSITIVE LOGITS
been
0.71
tats
0.68
persino
0.68
zyn
0.65
stockholm
0.65
blos
0.64
ridu
0.64
vry
0.64
BEEN
0.64
affez
0.63
Activations Density 0.479%