INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    imme
    1.83
    ির
    1.68
     bicara
    1.66
    1.66
     sebagainya
    1.64
     किन्तु
    1.63
    мл
    1.62
    &$\
    1.58
    toBe
    1.57
    bones
    1.55
    POSITIVE LOGITS
    ח
    2.23
    ج
    2.16
    غ
    1.87
    ط
    1.82
     In
    1.78
     Implementing
    1.78
     Reasoning
    1.73
    ح
    1.72
     זאת
    1.70
     Afterward
    1.68
    Act Density 0.145%

    No Known Activations