INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    userc
    -0.70
     Supplemental
    -0.62
    afety
    -0.62
    âĹ
    -0.61
     stub
    -0.61
    ©¶æ
    -0.59
     NETWORK
    -0.59
    hower
    -0.59
    ocument
    -0.58
     pretext
    -0.57
    POSITIVE LOGITS
    agos
    0.92
    forth
    0.78
    oha
    0.75
    gow
    0.73
    reath
    0.71
    rant
    0.70
    furt
    0.68
    asks
    0.68
    ted
    0.68
    stad
    0.67
    Act Density 0.089%

    No Known Activations