INDEX
Explanations
expressions of love and affection
New Auto-Interp
Negative Logits
umb
-0.16
iaz
-0.15
sexual
-0.15
виж
-0.15
ivism
-0.15
stroy
-0.15
sus
-0.15
una
-0.15
que
-0.15
idal
-0.14
POSITIVE LOGITS
affair
0.21
joy
0.20
/lo
0.19
ably
0.18
lessly
0.18
ingly
0.17
Hate
0.17
/h
0.16
eat
0.16
able
0.16
Activations Density 0.080%