INDEX
Explanations
phrases related to criticizing or expressing disappointment
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1823
+0.10
0.3%
872
+0.08
0.2%
1935
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1823
+0.10
0.03
1890
+0.08
0.04
1543
+0.07
0.03
Negative Logits
philanth
-1.11
fortn
-1.08
encomp
-1.07
shenan
-1.05
Immig
-1.05
volunte
-1.04
impractica
-1.02
resear
-1.01
reluct
-1.00
increa
-0.99
POSITIVE LOGITS
original
0.76
original
0.75
Original
0.72
<bos>
0.69
Original
0.67
originales
0.65
ORIGINAL
0.62
basics
0.62
integrity
0.61
gainera
0.61
Activations Density 0.430%