INDEX
    Explanations

    bad followed by descriptions

    New Auto-Interp
    Negative Logits
    ال
    1.58
    માં
    1.48
    1.47
    ın
    1.45
    1.45
    Ма
    1.34
    1.27
    ر
    1.24
    0
    1.23
    ‌ها
    1.22
    POSITIVE LOGITS
     on
    1.36
    c
    1.14
    bad
    1.02
    y
    1.02
    h
    1.00
    0.95
    t
    0.95
    ontal
    0.91
    𝟮
    0.90
     bad
    0.89
    Act Density 0.022%

    No Known Activations