INDEX
Explanations
references to societal norms regarding gender roles, particularly in relation to appearance and behavior
New Auto-Interp
Negative Logits
atan
-0.16
.construct
-0.15
Hlav
-0.15
sir
-0.14
iegel
-0.14
611
-0.14
225
-0.14
æ´²
-0.13
arat
-0.13
511
-0.13
POSITIVE LOGITS
superv
0.19
rought
0.16
ellij
0.16
loth
0.15
simp
0.15
célib
0.15
tokens
0.15
Straw
0.15
simples
0.14
disposed
0.14
Activations Density 0.049%