INDEX
    Explanations

    end of sentences with specific subsequent words

    New Auto-Interp
    Negative Logits
    t
    0.87
    h
    0.80
    d
    0.74
    k
    0.71
    f
    0.70
    c
    0.64
    n
    0.63
    ia
    0.62
    e
    0.60
    א
    0.58
    POSITIVE LOGITS
     ؟
    0.61
    يد
    0.56
    ۔
    0.55
     
    0.54
    0.51
    ým
    0.50
    ة
    0.50
    ؟
    0.50
    До
    0.50
     ؛
    0.49
    Act Density 0.089%

    No Known Activations