INDEX
Explanations
possessive language indicating ownership or belonging
New Auto-Interp
Negative Logits
oth
-0.18
odge
-0.17
ig
-0.16
ech
-0.16
wise
-0.16
sex
-0.15
PELL
-0.15
ndon
-0.14
imore
-0.14
ritz
-0.14
POSITIVE LOGITS
elves
0.24
own
0.21
zelf
0.20
/her
0.19
elay
0.18
gii
0.17
chaft
0.17
elon
0.16
æ¾
0.16
itable
0.16
Activations Density 0.013%