INDEX
Explanations
references to a single male individual at various points in the text
New Auto-Interp
Negative Logits
è¥
-0.15
imus
-0.15
aron
-0.14
eam
-0.14
taire
-0.14
angelo
-0.14
velle
-0.14
.sz
-0.14
ago
-0.14
ways
-0.14
POSITIVE LOGITS
/her
0.19
inerary
0.18
/th
0.18
/she
0.16
/we
0.16
kek
0.15
atically
0.15
å§
0.14
iner
0.14
ali
0.14
Activations Density 0.051%