INDEX
Explanations
phrases related to causes and effects
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1177
+0.16
0.5%
674
+0.15
0.5%
1253
+0.14
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
468
+0.16
0.06
392
+0.15
0.04
1801
+0.14
0.04
Negative Logits
impra
-1.11
reluct
-1.08
ineffec
-1.06
resear
-1.05
unspeak
-1.05
strick
-1.03
disagre
-1.02
desir
-1.00
unwarran
-1.00
apprehen
-1.00
POSITIVE LOGITS
<bos>
0.87
Autoritní
0.74
Tikang
0.67
Vidite
0.65
Италијани
0.63
<>",
0.63
>=",
0.62
enrique
0.61
IndentedString
0.61
<",
0.61
Activations Density 0.714%