INDEX
    Explanations

    following "and" descriptive words

    New Auto-Interp
    Negative Logits
    ر
    0.39
     edhe
    0.38
    ו
    0.38
    in
    0.38
    0.37
    8
    0.36
    י
    0.35
    ли
    0.35
    е
    0.34
    0.33
    POSITIVE LOGITS
    0.33
    0.33
    ке
    0.33
    чення
    0.32
     at
    0.32
     sebagainya
    0.32
    0.31
     inciting
    0.30
    0.30
    ಶ್ರ
    0.30
    Act Density 1.749%

    No Known Activations