INDEX
    Explanations

    strange, bizarre, or disturbing things

    New Auto-Interp
    Negative Logits
    ме
    0.34
    માં
    0.34
    0.33
    D
    0.33
    0.32
    کار
    0.32
    тім
    0.32
    Một
    0.32
    ре
    0.31
    ку
    0.31
    POSITIVE LOGITS
    .
    0.49
    0.49
    -
    0.46
    ari
    0.43
    ac
    0.39
    ic
    0.38
    ia
    0.38
    et
    0.36
    ore
    0.35
    ata
    0.33
    Act Density 0.793%

    No Known Activations