INDEX
    Explanations

    terms related to inappropriate, harmful, or unethical actions

    terms related to abusive and unethical behavior

    New Auto-Interp
    Negative Logits
    electric
    -0.79
    pop
    -0.74
    rition
    -0.68
    ellar
    -0.67
    soType
    -0.66
    oret
    -0.63
    arro
    -0.61
    ICA
    -0.61
    grown
    -0.60
    soDeliveryDate
    -0.60
    POSITIVE LOGITS
     perpetrated
    1.01
     inflicted
    0.90
     incurred
    0.90
     towards
    0.86
     misconduct
    0.83
     committed
    0.83
     crimes
    0.83
     dealings
    0.81
     whatsoever
    0.81
     harming
    0.80
    Act Density 0.174%

    No Known Activations