INDEX
    Explanations

    words in a specific non-English language

    New Auto-Interp
    Negative Logits
    amma
    -0.14
    roj
    -0.14
    leh
    -0.13
    بت
    -0.13
     lap
    -0.13
     dro
    -0.13
    ate
    -0.13
    al
    -0.13
    .prop
    -0.13
    mer
    -0.13
    POSITIVE LOGITS
    ppard
    0.17
    hiba
    0.16
    isclosed
    0.16
    ofday
    0.15
    PÅĻed
    0.15
     follando
    0.15
     slog
    0.15
    esktop
    0.15
     Sloan
    0.14
     Vance
    0.14
    Act Density 0.111%

    No Known Activations