INDEX
Explanations
pronouns related to individuals or groups
New Auto-Interp
Negative Logits
itself
-0.17
e
-0.17
taire
-0.15
onaut
-0.15
ayne
-0.14
ï
-0.14
pom
-0.14
purple
-0.14
ibel
-0.14
isma
-0.14
POSITIVE LOGITS
/us
0.38
/her
0.35
self
0.26
zelf
0.25
SELF
0.25
/th
0.23
iner
0.21
atically
0.20
-même
0.20
chy
0.19
Activations Density 0.155%