INDEX
Explanations
references to a specific female character
New Auto-Interp
Negative Logits
aroo
-0.20
heads
-0.15
uzzi
-0.15
itself
-0.15
ائر
-0.15
ाà¤Ĺत
-0.14
yourselves
-0.14
pedig
-0.14
jac
-0.13
issance
-0.13
POSITIVE LOGITS
own
0.37
/us
0.35
editary
0.34
/her
0.30
esy
0.29
ding
0.26
etical
0.24
mits
0.24
etics
0.23
SELF
0.23
Activations Density 0.063%