INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    K
    0.70
    0.70
     name
    0.69
    cat
    0.68
    the
    0.67
    pri
    0.67
    men
    0.67
    name
    0.67
    print
    0.67
    for
    0.66
    POSITIVE LOGITS
     שלא
    1.24
     Hanya
    1.21
    缺乏
    1.11
     没有
    1.09
     ""`
    1.08
    ँच
    1.08
    gll
    1.05
     толькі
    1.04
     zonder
    1.04
     沒有
    1.01
    Act Density 0.069%

    No Known Activations