INDEX
Explanations
repeated expressions of preference or affection
New Auto-Interp
Negative Logits
ista
-0.17
/by
-0.16
behalf
-0.16
ItemType
-0.16
idth
-0.15
ils
-0.15
ught
-0.14
uelles
-0.14
sel
-0.14
acco
-0.14
POSITIVE LOGITS
/dis
0.21
/lo
0.21
able
0.20
-minded
0.18
ably
0.17
elihood
0.16
latter
0.15
Ike
0.15
WISE
0.15
to
0.15
Activations Density 0.048%