INDEX
Explanations
pronouns and their various forms, particularly in the context of male subjects
New Auto-Interp
Negative Logits
vyk
-0.17
cido
-0.15
ray
-0.15
TestCategory
-0.15
sm
-0.15
oct
-0.15
504
-0.15
Metro
-0.15
crest
-0.14
vents
-0.14
POSITIVE LOGITS
ster
0.24
inner
0.24
wor
0.23
stm
0.21
inn
0.21
öff
0.20
kan
0.20
he
0.19
hi
0.18
mut
0.18
Activations Density 0.006%