INDEX
    Explanations

    unwanted sexual or harassing behavior

    New Auto-Interp
    Negative Logits
     masterpiece
    0.54
     plug
    0.52
     killer
    0.51
     trillions
    0.50
     doom
    0.49
     optimized
    0.49
     dynamically
    0.48
     civilizations
    0.46
     optimization
    0.46
     evils
    0.46
    POSITIVE LOGITS
     uncomfortable
    0.82
     harassing
    0.80
     harassment
    0.80
     intimidation
    0.78
     inappropriate
    0.77
     humiliating
    0.74
     escalating
    0.72
     conductas
    0.72
     comportamenti
    0.71
     discomfort
    0.70
    Act Density 0.034%

    No Known Activations