INDEX
Explanations
words related to the human body
terms related to bodily structures and waste
New Auto-Interp
Negative Logits
y
-0.98
xual
-0.96
hips
-0.90
es
-0.90
ournal
-0.81
¶æ
-0.80
e
-0.79
hip
-0.76
CI
-0.75
yne
-0.75
POSITIVE LOGITS
brance
0.77
Ship
0.75
awar
0.74
urbed
0.72
ressed
0.70
butt
0.70
iless
0.70
urb
0.70
ger
0.70
ancing
0.69
Activations Density 0.121%