INDEX
    Explanations

    phrases expressing fluctuations or changes, particularly those that indicate positive (ups) and negative (downs) experiences

    New Auto-Interp
    Negative Logits
    lessly
    -0.17
    296
    -0.15
    cken
    -0.15
     deepest
    -0.15
     Erk
    -0.15
     Pant
    -0.15
    ovice
    -0.14
    opsis
    -0.14
    ort
    -0.14
     Grip
    -0.14
    POSITIVE LOGITS
    /down
    0.44
    -down
    0.25
    /up
    0.23
    datable
    0.20
    ilon
    0.20
    scaling
    0.19
    graded
    0.19
    most
    0.19
    ward
    0.19
    ILON
    0.18
    Act Density 0.049%

    No Known Activations