INDEX
    Explanations

    common prompt beginnings

    New Auto-Interp
    Negative Logits
    ];
    0.88
     to
    0.88
    ">
    0.79
    ");
    0.77
    لى
    0.73
    ない
    0.72
     that
    0.71
    ');
    0.71
     batalha
    0.71
    s
    0.71
    POSITIVE LOGITS
    و
    0.90
    ו
    0.88
    il
    0.87
    ל
    0.87
    0.84
    0.77
    u
    0.72
    ע
    0.72
    其他
    0.71
    قبل
    0.71
    Act Density 0.004%

    No Known Activations