INDEX
Explanations
phrases related to philosophical and introspective reflections, particularly focusing on human value and truth perception
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
872
+0.12
0.3%
764
+0.09
0.3%
394
+0.09
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
872
+0.12
0.06
1937
+0.09
0.05
1060
+0.09
0.05
Negative Logits
reluct
-1.03
Juf
-1.00
aen
-0.98
inev
-0.94
fta
-0.92
maneu
-0.92
depic
-0.90
hcm
-0.86
thut
-0.86
fte
-0.85
POSITIVE LOGITS
depends
0.61
afla
0.59
dependent
0.56
tasche
0.56
gambe
0.56
లాలు
0.54
achieved
0.54
weetened
0.53
depend
0.53
defined
0.52
Activations Density 0.410%