INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     interven
    -0.07
    Implementation
    -0.07
    .criteria
    -0.07
     implants
    -0.07
    Impl
    -0.07
     implant
    -0.07
    文字
    -0.07
    -0.07
     Rost
    -0.06
    .impl
    -0.06
    POSITIVE LOGITS
                                                                                  
    0.09
    ummar
    0.09
    למיד
    0.08
    weta
    0.08
     სახ
    0.08
     Episode
    0.08
    ивание
    0.08
     melon
    0.08
     Biography
    0.08
    خف
    0.08
    Act Density 0.052%

    No Known Activations