INDEX
Explanations
expressions of personality traits and self-descriptions
New Auto-Interp
Negative Logits
ulet
-0.19
anz
-0.15
Vien
-0.15
hea
-0.15
ALIGN
-0.15
uzey
-0.14
éd
-0.14
oug
-0.13
kaar
-0.13
poke
-0.13
POSITIVE LOGITS
easily
0.20
outgoing
0.20
prone
0.18
sensitive
0.18
intro
0.18
independent
0.17
always
0.17
liable
0.16
boro
0.16
liability
0.16
Activations Density 0.416%