INDEX
Explanations
phrases expressing strong opinions or stances
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
599
+0.11
0.3%
766
+0.10
0.3%
1919
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
599
+0.11
0.06
1919
+0.10
0.06
862
+0.08
0.03
Negative Logits
hcm
-0.99
palab
-0.98
siena
-0.94
thut
-0.93
vne
-0.93
santiago
-0.93
fatis
-0.92
nomine
-0.92
parati
-0.91
milano
-0.90
POSITIVE LOGITS
ostavi
0.56
trust
0.53
viewing
0.51
watching
0.50
understand
0.48
expect
0.48
internetowa
0.47
ometrial
0.47
demand
0.46
understanding
0.46
Activations Density 0.402%