INDEX
Explanations
specific patterns or suffixes in words related to gender, social roles, or categories
New Auto-Interp
Negative Logits
SCIP
-0.17
Ashe
-0.16
igit
-0.16
PST
-0.16
ASH
-0.16
Sting
-0.15
aterno
-0.15
Bast
-0.14
ISK
-0.14
.SimpleButton
-0.14
POSITIVE LOGITS
SS
0.63
ss
0.61
ss
0.60
SS
0.59
_ss
0.54
.ss
0.51
еÑģÑģ
0.51
(ss
0.51
ess
0.50
:ss
0.49
Activations Density 0.133%