INDEX
Explanations
names, particularly those that are feminine or often associated with women
New Auto-Interp
Negative Logits
juan
-0.18
éħ
-0.15
obus
-0.14
Lifecycle
-0.14
indi
-0.14
erah
-0.14
ÑĮми
-0.14
yb
-0.14
iets
-0.14
Mid
-0.13
POSITIVE LOGITS
Ñĩе
0.16
ẫ
0.16
ãĥ³
0.16
amo
0.15
les
0.14
à¥Īन
0.14
strate
0.14
_multiplier
0.14
awner
0.14
ãĥ³ãĥ
0.14
Activations Density 0.048%