INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    >
    0.78
    ت
    0.76
    AND
    0.74
    t
    0.73
    ب
    0.70
    ح
    0.70
    ür
    0.68
    ع
    0.68
    st
    0.65
    bos
    0.63
    POSITIVE LOGITS
    ння
    0.81
    ى
    0.81
    ли
    0.80
    0.79
    0.79
    0.76
    8
    0.73
     They
    0.71
    жа
    0.71
    ния
    0.69
    Act Density 0.080%

    No Known Activations