INDEX
    Explanations

    rape and harassment

    New Auto-Interp
    Negative Logits
     violating
    -0.78
     harassing
    -0.76
     Viol
    -0.73
     baptism
    -0.73
     violated
    -0.71
     publiques
    -0.70
    SharedDtor
    -0.69
     démocr
    -0.68
     intimidate
    -0.66
     intimidating
    -0.66
    POSITIVE LOGITS
     control
    0.52
     agents
    0.52
    Autoritní
    0.51
    NUMX
    0.51
     controlled
    0.49
     agent
    0.47
     open
    0.47
    styles
    0.47
     minute
    0.46
    ERICA
    0.46
    Act Density 0.044%

    No Known Activations