INDEX
Explanations
references to body positivity and the discussion of societal beauty standards
New Auto-Interp
Negative Logits
mux
-0.15
bere
-0.15
θη
-0.14
dle
-0.13
sentimental
-0.13
óm
-0.13
ux
-0.13
лоÑĩ
-0.13
sodom
-0.13
Geoff
-0.13
POSITIVE LOGITS
body
0.35
beauty
0.33
Body
0.29
Beauty
0.28
BODY
0.26
Beauty
0.26
Body
0.25
/body
0.24
-body
0.24
bodies
0.24
Activations Density 0.115%