INDEX
Explanations
references to social roles and identities
New Auto-Interp
Negative Logits
zers
-0.17
stuff
-0.16
isko
-0.16
ions
-0.16
Blonde
-0.15
ods
-0.14
Jou
-0.14
awe
-0.14
ilden
-0.14
ald
-0.14
POSITIVE LOGITS
unto
0.23
capable
0.19
who
0.17
able
0.16
extra
0.16
/Area
0.14
ÑĢеб
0.14
reb
0.14
बनन
0.14
ician
0.14
Activations Density 0.226%