INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ların
    1.10
    ki
    0.94
    ной
    0.92
    ov
    0.91
    that
    0.89
    v
    0.89
     are
    0.86
    ub
    0.86
    o
    0.85
    0.85
    POSITIVE LOGITS
    ي
    1.30
    י
    1.18
    ل
    0.98
    ות
    0.82
    ول
    0.81
    0.79
    يارات
    0.79
     etiquetas
    0.78
    使用了
    0.77
    يلي
    0.77
    Act Density 0.010%

    No Known Activations