INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    いう
    -0.08
    _roi
    -0.07
    Kim
    -0.07
    nombre
    -0.07
     pixels
    -0.06
    Td
    -0.06
    ildren
    -0.06
     authoritarian
    -0.06
     rnn
    -0.06
    Action
    -0.06
    POSITIVE LOGITS
    Manifest
    0.07
     brainstorm
    0.07
    _manifest
    0.06
    müş
    0.06
    ิป
    0.06
     confirmed
    0.06
    FORMAT
    0.06
    (diff
    0.06
     immensely
    0.06
     Inform
    0.06
    Act Density 0.002%

    No Known Activations