INDEX
Explanations
phrases related to judgment, evaluation, and worthiness
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1842
+0.14
0.4%
478
+0.09
0.3%
1042
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
392
+0.14
0.03
1617
+0.09
0.04
1553
+0.09
0.06
Negative Logits
fortn
-0.75
impractica
-0.73
wherea
-0.72
inev
-0.70
maneu
-0.69
cuck
-0.68
ACKNOWLEDGMENTS
-0.68
indestru
-0.67
disreg
-0.66
reconnaît
-0.66
POSITIVE LOGITS
suitability
0.68
suitable
0.66
worthy
0.66
deserving
0.65
Suitable
0.62
Suitable
0.61
IsContent
0.60
<bos>
0.60
vencia
0.59
fit
0.57
Activations Density 0.559%