INDEX
    Explanations

    sentences indicating potential dangers or warnings

    New Auto-Interp
    Negative Logits
    .)↵↵↵↵
    -0.14
     ï¼į
    -0.14
     ÂŃ
    -0.14
    ï¿¥
    -0.14
    wat
    -0.14
    adil
    -0.13
    ayload
    -0.13
    hon
    -0.13
    maj
    -0.13
    _invoke
    -0.13
    POSITIVE LOGITS
     ,
    0.21
     handjob
    0.16
     ,↵
    0.16
    ÑĢд
    0.15
    illi
    0.15
     she
    0.15
    And
    0.15
     And
    0.14
     ØĮ
    0.14
    unde
    0.14
    Act Density 0.074%

    No Known Activations