INDEX
Explanations
references to men and masculinity in various contexts
New Auto-Interp
Negative Logits
dale
-0.17
allon
-0.17
AuthenticationService
-0.15
ture
-0.15
dge
-0.15
enticate
-0.15
gaard
-0.15
DAG
-0.15
er
-0.14
ssel
-0.14
POSITIVE LOGITS
opause
0.28
aced
0.25
folk
0.23
volent
0.22
ninger
0.21
ubar
0.19
-only
0.19
aces
0.17
insky
0.16
orca
0.16
Activations Density 0.040%