INDEX
Explanations
references to the pronoun "her" and related possessive forms
New Auto-Interp
Negative Logits
eous
-0.17
leine
-0.15
Ĥæķ°
-0.15
hiba
-0.15
sse
-0.15
Washer
-0.14
abox
-0.14
arakter
-0.14
wy
-0.14
ful
-0.14
POSITIVE LOGITS
editary
0.27
/her
0.25
esy
0.20
/she
0.17
ding
0.16
own
0.16
/us
0.15
itable
0.15
ewith
0.15
din
0.15
Activations Density 0.262%