INDEX
Explanations
phrases addressing the user
New Auto-Interp
Negative Logits
[
0.61
(
0.60
↵↵
0.59
Со
0.59
↵
0.58
</
0.58
0.57
Ver
0.56
)
0.56
Пол
0.56
POSITIVE LOGITS
yourselves
1.84
yourself
1.83
Yourself
1.83
yourself
1.73
me
1.72
نفسك
1.48
your
1.47
してください
1.45
jezelf
1.44
해주세요
1.38
Activations Density 0.451%