INDEX
Explanations
phrases with negative connotations or controversial topics
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
201
+0.15
0.7%
680
+0.12
0.5%
629
+0.12
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
201
+0.15
0.02
1174
+0.12
0.02
981
+0.12
0.03
Negative Logits
Iglesia
-0.50
święta
-0.43
Sánchez
-0.43
Williams
-0.42
száll
-0.42
classnames
-0.41
RELIGION
-0.41
Петра
-0.40
Rivers
-0.40
vertelt
-0.40
POSITIVE LOGITS
Fo
1.18
Fo
1.14
FO
1.05
fo
0.97
fo
0.97
FO
0.87
Foil
0.87
foams
0.84
FOG
0.83
foaming
0.82
Activations Density 0.138%