INDEX
    Explanations

    statements related to reasoning and justification

    New Auto-Interp
    Negative Logits
    aint
    -0.16
    485
    -0.14
     Trab
    -0.14
    ży
    -0.13
    roy
    -0.13
    tpl
    -0.13
    484
    -0.13
    /cat
    -0.13
    ola
    -0.13
    365
    -0.13
    POSITIVE LOGITS
     why
    0.28
     reasons
    0.27
     Reasons
    0.23
    why
    0.23
     âĹĦ
    0.22
    reason
    0.21
     reason
    0.20
    ìķ½
    0.19
    .reason
    0.19
    Reason
    0.18
    Act Density 0.207%

    No Known Activations