INDEX
Explanations
references to individuals' identities and how they are perceived by others
New Auto-Interp
Negative Logits
rary
-0.16
anas
-0.16
wording
-0.15
motive
-0.14
ande
-0.14
opy
-0.14
illon
-0.14
åħ¸
-0.13
enu
-0.13
Zy
-0.13
POSITIVE LOGITS
nick
0.24
nick
0.23
nickname
0.21
Nick
0.20
nickname
0.19
.nickname
0.18
shortened
0.18
shorter
0.17
rever
0.17
informal
0.17
Activations Density 0.078%