INDEX
Explanations
expressions of disbelief or surprise
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1265
+0.11
0.4%
1741
+0.10
0.3%
381
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
353
+0.11
0.02
689
+0.10
0.02
521
+0.10
0.02
Negative Logits
Hahahahaha
-0.68
Hahahaha
-0.66
ulipas
-0.61
viedo
-0.61
meras
-0.60
ഊ
-0.60
€/
-0.59
girasol
-0.55
naran
-0.55
})->
-0.55
POSITIVE LOGITS
Oh
0.91
oh
0.86
Oh
0.84
prouve
0.81
ferait
0.77
scrat
0.74
OH
0.72
pooh
0.69
défend
0.67
reconno
0.67
Activations Density 0.032%