INDEX
    Explanations

    illegal or harmful content

    words describing prohibited content types and policy violations on online platforms.

    New Auto-Interp
    Negative Logits
     vinci
    -1.24
     marta
    -1.20
     satel
    -1.16
    はこんな感じ
    -1.13
     wanda
    -1.13
     philippe
    -1.13
    を知る
    -1.12
     dorado
    -1.10
     paulo
    -1.09
     marmor
    -1.09
    POSITIVE LOGITS
     or
    1.46
     versátil
    1.45
     Bardzo
    1.43
     içeri
    1.41
     görüntüsü
    1.30
     delitos
    1.27
     Сергей
    1.16
    либо
    1.15
     content
    1.15
     extremadamente
    1.14
    Act Density 0.036%

    No Known Activations