INDEX
Explanations
phrases indicating being out of alignment or disagreement with something
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.14
0.4%
897
+0.12
0.4%
1557
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
897
+0.14
0.03
1557
+0.12
0.03
1356
+0.09
0.03
Negative Logits
BIBLIO
-0.46
dismayed
-0.45
preferring
-0.45
end
-0.44
frowning
-0.44
wój
-0.43
'.';
-0.43
".";
-0.43
ători
-0.43
undoubted
-0.43
POSITIVE LOGITS
traktor
1.05
sopr
1.02
abnorm
1.01
kask
0.99
overla
0.99
stik
0.99
nutr
0.99
ordina
0.99
lapto
0.97
tramont
0.97
Activations Density 0.072%