INDEX
    Explanations

    expressions of enjoyment or positive feelings

    New Auto-Interp
    Negative Logits
    nd
    -0.19
    ities
    -0.18
    itÃł
    -0.17
    ansom
    -0.17
    érique
    -0.16
    haps
    -0.16
    ITY
    -0.16
    nds
    -0.16
    hausen
    -0.16
    has
    -0.16
    POSITIVE LOGITS
    fully
    0.42
    ened
    0.35
    eous
    0.34
    ening
    0.33
    ful
    0.32
    mare
    0.30
    enment
    0.29
    ting
    0.28
    ning
    0.27
    fulness
    0.26
    Act Density 0.012%

    No Known Activations