INDEX
Explanations
references to mothers or maternal figures
New Auto-Interp
Negative Logits
wright
-0.16
egin
-0.14
owy
-0.14
-linear
-0.14
rd
-0.14
iders
-0.14
ird
-0.14
ãĥĥãĥĪ
-0.14
Mocks
-0.13
strup
-0.13
POSITIVE LOGITS
hood
0.20
-child
0.18
gos
0.17
itespace
0.17
gom
0.15
aight
0.15
eros
0.15
ÑĤаж
0.15
REN
0.15
SHIP
0.15
Activations Density 0.041%