INDEX
Explanations
words related to emotions and reactions, especially negative emotions like disgust, anger, and upset
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1056
+0.12
0.3%
1013
+0.12
0.3%
1141
+0.09
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1056
+0.12
0.07
247
+0.12
0.06
1490
+0.09
0.06
Negative Logits
meras
-0.88
utop
-0.85
makro
-0.83
kram
-0.77
elek
-0.77
hunde
-0.71
paus
-0.71
ortop
-0.70
sement
-0.70
palet
-0.70
POSITIVE LOGITS
Plotting
0.64
about
0.62
by
0.53
INPUTS
0.53
Iterate
0.51
jątk
0.51
Initialise
0.49
ness
0.49
because
0.48
Parsing
0.47
Activations Density 0.228%