INDEX
    Explanations

    phrases indicating support or affirmation

    New Auto-Interp
    Negative Logits
     Rouge
    -0.16
    anch
    -0.15
    DM
    -0.15
    eka
    -0.14
    UG
    -0.14
     conj
    -0.14
    ATS
    -0.14
     favor
    -0.14
    ad
    -0.14
    eness
    -0.14
    POSITIVE LOGITS
     backing
    0.32
    Backing
    0.27
    /back
    0.24
     backed
    0.23
     backs
    0.21
    -backed
    0.21
    (back
    0.20
    =back
    0.20
    haul
    0.19
    aret
    0.18
    Act Density 0.015%

    No Known Activations