INDEX
    Explanations

    preventing or stopping actions

    New Auto-Interp
    Negative Logits
    ка
    0.83
    де
    0.78
    при
    0.75
    einander
    0.74
    的不同
    0.70
    گی
    0.69
     different
    0.69
    ωσ
    0.69
     Different
    0.69
    מים
    0.68
    POSITIVE LOGITS
     altogether
    0.99
     catég
    0.91
     wszel
    0.84
     impide
    0.84
    tive
    0.83
    ishment
    0.80
    0.79
    ുവെ
    0.77
    wirkung
    0.77
    갑습니다
    0.76
    Act Density 0.274%

    No Known Activations