INDEX
Explanations
terms related to philosophical or academic discussions around theories and concepts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
872
+0.17
0.5%
198
+0.09
0.3%
1531
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1953
+0.17
0.04
872
+0.09
0.07
1531
+0.09
0.04
Negative Logits
intersper
-1.15
reluct
-1.04
unspeak
-1.04
shenan
-0.96
intrigu
-0.95
sophistic
-0.95
disagre
-0.93
apprehen
-0.92
indestru
-0.92
philanth
-0.90
POSITIVE LOGITS
assume
0.62
overlook
0.61
insuffisamment
0.61
ignore
0.61
overlooks
0.60
overlooking
0.59
focus
0.57
ignores
0.55
assumptions
0.55
enderror
0.54
Activations Density 0.561%