INDEX
    Explanations

    sentences that list reasons or explanations

    New Auto-Interp
    Negative Logits
    kl
    -0.17
    aight
    -0.16
    \grid
    -0.15
    ì¡°
    -0.14
    wort
    -0.14
    licable
    -0.14
    kla
    -0.14
    éĭ
    -0.13
    alus
    -0.13
    _spin
    -0.13
    POSITIVE LOGITS
     Firstly
    0.30
     firstly
    0.26
    ãģ¾ãģļ
    0.24
     first
    0.23
    âijł
    0.21
     First
    0.20
     primero
    0.20
    First
    0.19
    наÑĩала
    0.19
     먼ìłĢ
    0.19
    Act Density 0.192%

    No Known Activations