INDEX
Explanations
references to queer identity and related terms
New Auto-Interp
Negative Logits
nya
-0.18
shape
-0.18
ship
-0.16
orne
-0.16
son
-0.16
sw
-0.16
ly
-0.16
sun
-0.16
s
-0.15
li
-0.15
POSITIVE LOGITS
uing
0.29
ued
0.28
bec
0.28
ues
0.28
ens
0.21
erness
0.21
uetype
0.20
UES
0.19
estion
0.19
ENS
0.17
Activations Density 0.008%