INDEX
Explanations
instances of enjoyment or positive experiences
New Auto-Interp
Negative Logits
Ñģли
-0.17
enjoying
-0.16
drawing
-0.15
ching
-0.15
owing
-0.14
gere
-0.14
uras
-0.14
older
-0.14
odes
-0.14
ched
-0.14
POSITIVE LOGITS
ably
0.36
able
0.23
ments
0.22
ment
0.21
ables
0.20
erals
0.20
/dis
0.20
themselves
0.17
freedoms
0.16
being
0.16
Activations Density 0.041%