INDEX
Explanations
alignment with goals or values
New Auto-Interp
Negative Logits
d
0.85
}
0.66
s
0.62
h
0.61
of
0.61
l
0.57
<h2>
0.53
()]
0.53
g
0.53
ية
0.52
POSITIVE LOGITS
aligned
0.86
aligns
0.86
aligning
0.82
alignment
0.80
Alignment
0.75
straight
0.74
straightened
0.74
STRAIGHT
0.74
Straight
0.73
Straight
0.72
Activations Density 0.022%