INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    0.69
     them
    0.64
     **
    0.63
    0.61
    works
    0.61
     Working
    0.60
     Them
    0.59
    <sup>
    0.58
    0.56
    Subscribe
    0.56
    POSITIVE LOGITS
    <unused1855>
    0.94
    <unused1724>
    0.93
    <unused338>
    0.92
    <unused1766>
    0.92
    <unused321>
    0.91
    <unused416>
    0.90
    <unused311>
    0.90
    <unused2066>
    0.90
    arder
    0.89
     sampah
    0.88
    Act Density 0.040%

    No Known Activations