INDEX
    Explanations

    phrases indicating progress or improvement

    New Auto-Interp
    Negative Logits
    iggers
    -0.16
    urga
    -0.15
     Rad
    -0.15
     Ru
    -0.15
     Ved
    -0.15
     Chang
    -0.14
     rad
    -0.14
     conserv
    -0.14
     Pale
    -0.14
     pale
    -0.14
    POSITIVE LOGITS
    ightly
    0.15
     å¢
    0.15
    rels
    0.15
     íļ
    0.15
     XO
    0.15
    hetto
    0.14
     incre
    0.14
     increment
    0.14
    é̲
    0.14
     thêm
    0.14
    Act Density 0.202%

    No Known Activations