INDEX
    Explanations

    code comments and structure

    New Auto-Interp
    Negative Logits
    ،
    0.83
    ;
    0.71
    :
    0.68
    $,
    0.64
    ',
    0.63
    '
    0.57
    ",
    0.54
    0.52
    ING
    0.52
    >
    0.52
    POSITIVE LOGITS
    is
    0.86
    σ
    0.63
    л
    0.62
    0.59
    و
    0.56
    行く
    0.52
    е
    0.52
    ى
    0.52
    in
    0.51
    0.49
    Act Density 0.671%

    No Known Activations