INDEX
Explanations
expressions of enjoyment or positive feelings
New Auto-Interp
Negative Logits
nd
-0.19
ities
-0.18
itÃł
-0.17
ansom
-0.17
érique
-0.16
haps
-0.16
ITY
-0.16
nds
-0.16
hausen
-0.16
has
-0.16
POSITIVE LOGITS
fully
0.42
ened
0.35
eous
0.34
ening
0.33
ful
0.32
mare
0.30
enment
0.29
ting
0.28
ning
0.27
fulness
0.26
Activations Density 0.012%