INDEX
Explanations
repeated references to a singular female subject
New Auto-Interp
Negative Logits
resse
-0.18
eum
-0.16
leneck
-0.15
ouis
-0.15
lass
-0.15
gger
-0.15
cone
-0.15
osite
-0.15
ivia
-0.15
cliffe
-0.15
POSITIVE LOGITS
/us
0.28
/her
0.23
editary
0.23
ding
0.22
zelf
0.19
/th
0.18
ewith
0.18
ded
0.17
etical
0.17
esy
0.17
Activations Density 0.121%