INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    י
    0.69
    ا
    0.66
    y
    0.58
    AH
    0.57
    0.55
    tanto
    0.54
    tól
    0.54
    ARE
    0.52
    dır
    0.52
    ي
    0.52
    POSITIVE LOGITS
    .
    0.68
     a
    0.54
    ,
    0.53
     the
    0.49
    0.46
     а
    0.46
    з
    0.46
    :
    0.44
    it
    0.42
     allá
    0.41
    Act Density 0.016%

    No Known Activations