INDEX
Explanations
phrases indicating a firm opinion or viewpoint
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1842
+0.17
0.5%
468
+0.14
0.4%
394
+0.11
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
599
+0.17
0.07
468
+0.14
0.05
1958
+0.11
0.05
Negative Logits
conspic
-1.01
uncin
-0.94
<bos>
-0.94
desir
-0.92
dsg
-0.89
antem
-0.89
effe
-0.88
coar
-0.88
inext
-0.87
lamborghini
-0.86
POSITIVE LOGITS
…
0.52
={`/0.50
...
0.50
fucked
0.48
deeply
0.46
ínguez
0.46
[
0.46
Tôi
0.45
utilizarse
0.45
my
0.45
Activations Density 0.631%