INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    '
    0.89
    р
    0.88
    0.86
    -
    0.85
    н
    0.85
    й
    0.82
    0.78
    га
    0.78
    У
    0.76
    Ο
    0.75
    POSITIVE LOGITS
     or
    1.16
     but
    1.04
     are
    1.02
     for
    0.98
     is
    0.96
     as
    0.94
     of
    0.91
    are
    0.86
    ک
    0.83
    re
    0.82
    Act Density 0.008%

    No Known Activations