INDEX
Explanations
words related to war, conflict, and distortion
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2016
+0.10
0.3%
1233
+0.09
0.3%
1077
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1363
+0.10
0.03
1233
+0.09
0.02
2016
+0.09
0.03
Negative Logits
republi
-0.70
praktik
-0.66
kandid
-0.64
pól
-0.64
gius
-0.60
Kün
-0.60
trö
-0.59
biograf
-0.58
granat
-0.58
Städ
-0.57
POSITIVE LOGITS
twist
0.85
bend
0.82
bending
0.80
bends
0.79
twisting
0.79
twisted
0.79
bent
0.78
distorted
0.77
twists
0.76
distortion
0.74
Activations Density 0.143%