INDEX
Explanations
expressions of defiance or assertiveness
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1919
+0.13
0.4%
381
+0.11
0.3%
1510
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1919
+0.13
0.11
1415
+0.11
0.06
805
+0.10
0.07
Negative Logits
magis
-1.47
fatis
-1.43
hcm
-1.35
alip
-1.35
territo
-1.34
susun
-1.32
paff
-1.30
umo
-1.30
levis
-1.30
aen
-1.28
POSITIVE LOGITS
never
0.74
don
0.73
am
0.72
cannot
0.69
want
0.69
prefer
0.66
wanted
0.65
know
0.65
didn
0.65
hate
0.64
Activations Density 0.328%