INDEX
Explanations
references to male characters and their actions or states
New Auto-Interp
Negative Logits
.dk
-0.16
obl
-0.15
еÑĢп
-0.15
oker
-0.15
BO
-0.15
оба
-0.15
exion
-0.14
ackers
-0.14
ful
-0.14
logg
-0.14
POSITIVE LOGITS
/she
0.21
/her
0.18
idi
0.17
rip
0.17
kul
0.17
ady
0.16
idelberg
0.16
Kah
0.15
[
0.15
Majesty
0.15
Activations Density 0.361%