INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Christopher
    -0.08
     valign
    -0.08
    zwa
    -0.07
     chrono
    -0.07
     확보
    -0.07
    �ราะห์
    -0.07
     Lagu
    -0.07
     busc
    -0.07
    ("//*[@
    -0.07
    ROR
    -0.07
    POSITIVE LOGITS
     profanity
    0.17
     misconduct
    0.14
     offend
    0.14
     obscene
    0.13
     offending
    0.13
     derog
    0.12
     vulgar
    0.12
     extremist
    0.12
     hateful
    0.12
     taboo
    0.12
    Act Density 0.012%

    No Known Activations