INDEX
Explanations
expressions of enjoyment and pleasure in various contexts
New Auto-Interp
Negative Logits
older
-0.17
atte
-0.15
esar
-0.15
enjoying
-0.14
owing
-0.14
uras
-0.14
.rt
-0.14
ual
-0.14
arin
-0.14
ford
-0.13
POSITIVE LOGITS
ably
0.36
ment
0.24
ments
0.24
able
0.23
themselves
0.21
yourselves
0.20
ables
0.19
/dis
0.19
erals
0.19
ourselves
0.19
Activations Density 0.043%