INDEX
Explanations
references to social or cultural norms regarding women's behavior and rights
New Auto-Interp
Negative Logits
autorytatywna
-0.86
חיצוניים
-0.76
extAlignment
-0.76
ConstraintMaker
-0.76
purpoſe
-0.75
beginnetje
-0.75
Portale
-0.75
defaultstate
-0.74
myſelf
-0.74
isateur
-0.74
POSITIVE LOGITS
BarStyle
0.43
Einf
0.42
S
0.39
W
0.39
ro
0.39
top
0.38
DebuggerNonUser
0.38
saluto
0.38
How
0.37
UrlEncoded
0.37
Activations Density 1.903%