INDEX
Explanations
references to the concept of 'girl' in various contexts
New Auto-Interp
Negative Logits
eer
-0.18
raman
-0.17
gentlemen
-0.17
gentleman
-0.16
gratuite
-0.15
eworthy
-0.15
lid
-0.15
340
-0.15
etus
-0.15
etr
-0.15
POSITIVE LOGITS
ie
0.33
hood
0.32
friends
0.31
friend
0.31
-next
0.28
scout
0.27
ies
0.26
/w
0.26
boss
0.26
Scout
0.25
Activations Density 0.037%