INDEX
Explanations
phrases related to discussing the pros and cons of a topic
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
203
+0.09
0.3%
1675
+0.09
0.2%
80
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1675
+0.09
0.05
990
+0.09
0.03
852
+0.07
0.03
Negative Logits
voleva
-0.81
faceva
-0.77
vogli
-0.73
dichi
-0.69
aspetta
-0.68
intende
-0.67
lavora
-0.67
sape
-0.66
sappi
-0.65
dimenti
-0.64
POSITIVE LOGITS
benefits
0.81
advantages
0.79
Disadvantages
0.77
benefits
0.77
Benefits
0.76
disadvantages
0.73
Advantages
0.73
Benefits
0.71
downsides
0.70
drawbacks
0.69
Activations Density 0.284%