INDEX
    Explanations

    phrases that indicate recognition or acknowledgment of issues

    New Auto-Interp
    Negative Logits
    Ñıг
    -0.17
    illow
    -0.15
    ãĥ¼ãĥ©
    -0.15
    ÄĽÅ¾
    -0.14
    ķĮ
    -0.14
    ascade
    -0.14
    è¾ŀ
    -0.14
    ragen
    -0.14
    ĵåIJį
    -0.14
    λε
    -0.14
    POSITIVE LOGITS
    them
    0.18
     otherwise
    0.16
    ot
    0.16
     Ahmed
    0.15
     peg
    0.15
    )
    0.14
     Them
    0.14
     Sniper
    0.14
    ownt
    0.14
    å®ĥ们
    0.14
    Act Density 0.220%

    No Known Activations