INDEX
Explanations
expressions of love and affection
New Auto-Interp
Negative Logits
uman
-0.16
ootball
-0.15
ization
-0.14
-0.14
ols
-0.14
andon
-0.14
emiz
-0.14
ks
-0.14
ophile
-0.13
sexual
-0.13
POSITIVE LOGITS
ably
0.19
eat
0.18
joy
0.17
fully
0.17
kind
0.17
edException
0.17
affair
0.16
_errno
0.15
full
0.15
leigh
0.14
Activations Density 0.080%