INDEX
    Explanations

    concepts related to arguments and reasoning

    New Auto-Interp
    Negative Logits
     themselves
    -0.23
    ]")]↵
    -0.17
    à¹Ģà¸Ńà¸ĩ
    -0.16
    ']){↵
    -0.16
    ãģĵãģĨ
    -0.15
    iteli
    -0.14
     Há»į
    -0.14
     THESE
    -0.14
    zd
    -0.14
    .nlm
    -0.14
    POSITIVE LOGITS
     its
    1.38
     Its
    1.15
    Its
    1.09
    its
    1.00
    åħ¶
    0.74
     оно
    0.63
    å®ĥ
    0.59
     åħ¶
    0.53
     ITS
    0.50
     à¤ĩसà¤ķ
    0.48
    Act Density 0.102%

    No Known Activations