INDEX
    Explanations

    Correctness/Rightness

    New Auto-Interp
    Negative Logits
     Collapse
    -0.07
    子供
    -0.07
    низ
    -0.07
     참고
    -0.06
     Hund
    -0.06
    icerca
    -0.06
    43
    -0.06
    63
    -0.06
    زام
    -0.06
    ко
    -0.06
    POSITIVE LOGITS
     wf
    0.07
     entitled
    0.07
     wrong
    0.06
     brilliantly
    0.06
    pellier
    0.06
    .w
    0.06
    978
    0.06
    roi
    0.06
     introduce
    0.06
    _w
    0.06
    Act Density 0.046%

    No Known Activations