INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ،
    0.95
    ي
    0.92
    0.89
    د
    0.88
    д
    0.87
    i
    0.86
    am
    0.84
    0.84
    ні
    0.79
    0.78
    POSITIVE LOGITS
    to
    0.89
    n
    0.84
    0.79
    is
    0.74
    0.70
    M
    0.68
    Nueva
    0.67
    tiene
    0.67
    تد
    0.66
    r
    0.65
    Act Density 0.010%

    No Known Activations