INDEX
Explanations
expressions of personal identity and self-reflection
New Auto-Interp
Negative Logits
itself
-0.21
reck
-0.19
st
-0.18
ly
-0.18
(s
-0.17
lx
-0.16
less
-0.16
lv
-0.16
liness
-0.16
themselves
-0.16
POSITIVE LOGITS
’m
0.39
'm
0.34
’ve
0.32
am
0.32
myself
0.32
've
0.31
бÑĥдÑĥ
0.27
’ll
0.25
/we
0.24
'll
0.23
Activations Density 0.455%