INDEX
Explanations
phrases indicating support, preparedness, and liking something
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
198
+0.14
0.4%
1314
+0.09
0.3%
678
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
198
+0.14
0.06
92
+0.09
0.05
100
+0.07
0.04
Negative Logits
unspeak
-0.90
unwarran
-0.89
disagre
-0.89
reluct
-0.86
impra
-0.86
shenan
-0.84
increa
-0.83
pamph
-0.83
affor
-0.81
inconce
-0.81
POSITIVE LOGITS
things
0.60
tagPool
0.58
grans
0.57
masaj
0.57
disambiguazione
0.56
homonymie
0.54
anything
0.53
Roskov
0.53
excur
0.52
ностран
0.51
Activations Density 0.762%