INDEX
Explanations
phrases indicating uncertainty or personal opinion
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1026
+0.11
0.3%
690
+0.09
0.3%
1310
+0.08
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1026
+0.11
0.04
869
+0.09
0.04
1691
+0.08
0.03
Negative Logits
poliuret
-0.90
porcelaine
-0.86
zó
-0.83
velours
-0.74
republi
-0.73
Præ
-0.72
canne
-0.71
vermel
-0.71
majest
-0.70
talle
-0.70
POSITIVE LOGITS
there
0.62
seems
0.59
ExecuteAsync
0.58
animity
0.57
that
0.54
SOMEONE
0.51
ürze
0.51
trône
0.50
uitgenodigd
0.50
suckers
0.50
Activations Density 0.086%