INDEX
    Explanations

    also, followed by words

    followed by question words

    New Auto-Interp
    Negative Logits
    ig
    0.86
    ib
    0.84
    0.81
    ла
    0.79
    ri
    0.77
    پ
    0.77
    ро
    0.75
    ون
    0.75
    ти
    0.72
    ur
    0.71
    POSITIVE LOGITS
     is
    0.92
     
    0.91
    0.82
     be
    0.77
     was
    0.76
     an
    0.75
     it
    0.71
     on
    0.70
     of
    0.67
     è
    0.67
    Act Density 0.373%

    No Known Activations