INDEX
Explanations
phrases indicating the presence or involvement of women in various contexts
New Auto-Interp
Negative Logits
purpoſe
-0.94
pleaſure
-0.89
juſ
-0.81
ſever
-0.80
myſelf
-0.79
feroit
-0.77
uſe
-0.77
ſtate
-0.77
uſed
-0.75
ſta
-0.74
POSITIVE LOGITS
s
1.37
s
0.69
{~0.60
own
0.59
𝑠
0.55
ils
0.54
Ys
0.52
etts
0.52
mens
0.52
ds
0.51
Activations Density 0.248%