INDEX
Explanations
references to men and male-related topics
New Auto-Interp
Negative Logits
Sad
-0.19
Sad
-0.17
ters
-0.16
eren
-0.15
ahn
-0.15
atik
-0.15
aters
-0.15
ernote
-0.14
uil
-0.14
tee
-0.14
POSITIVE LOGITS
opause
0.23
cken
0.21
endez
0.20
orca
0.20
ubar
0.20
kes
0.20
iscal
0.19
едж
0.19
isci
0.19
acing
0.19
Activations Density 0.016%