INDEX
Explanations
mentions of physical removal or alteration in a social or political context
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1392
+0.12
0.4%
874
+0.10
0.3%
61
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
124
+0.12
0.03
1392
+0.10
0.04
61
+0.10
0.03
Negative Logits
impractica
-0.89
reluct
-0.86
disagre
-0.85
affor
-0.84
impra
-0.82
perfet
-0.81
excru
-0.81
scrat
-0.79
uninten
-0.79
Wtf
-0.78
POSITIVE LOGITS
removal
1.06
remove
1.04
removed
1.02
remove
1.02
removes
1.00
Remove
0.98
Remove
0.97
Removal
0.94
removed
0.94
removing
0.90
Activations Density 0.148%