INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    u
    0.70
    p
    0.53
    b
    0.52
    as
    0.50
    ور
    0.49
    d
    0.47
    0.46
    0.44
    c
    0.43
    z
    0.42
    POSITIVE LOGITS
    0.70
    0.64
    0.61
    $.
    0.54
    0.52
    .。
    0.49
    。[
    0.47
    ۔
    0.47
    .”
    0.46
    OC
    0.45
    Act Density 0.002%

    No Known Activations