INDEX
    Explanations

    words and phrases related to unethical or abusive practices

    New Auto-Interp
    Negative Logits
    cket
    -0.16
    .scalablytyped
    -0.15
    lamaz
    -0.15
    owie
    -0.15
    victim
    -0.14
    coat
    -0.14
    umar
    -0.14
    eyh
    -0.14
    (æĹ¥
    -0.14
    /=
    -0.14
    POSITIVE LOGITS
     Practices
    0.30
    ness
    0.29
     practices
    0.29
     behaviour
    0.27
     behavior
    0.27
    /question
    0.25
    ities
    0.25
    /problem
    0.23
     Behavior
    0.22
    /il
    0.22
    Act Density 0.122%

    No Known Activations