INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ו
    0.82
    ل
    0.81
    י
    0.79
    ن
    0.75
    و
    0.72
    0.71
    0.68
     sh
    0.68
    з
    0.66
    ي
    0.66
    POSITIVE LOGITS
     Circus
    0.83
     circus
    0.79
    🎪
    0.66
     acrob
    0.62
     Clown
    0.60
    起来
    0.59
    𝐈
    0.58
    Circ
    0.55
    ING
    0.54
    upati
    0.54
    Act Density 0.003%

    No Known Activations