INDEX
    Explanations

    words related to negative behavior or actions, specifically focusing on harassment

    instances of the word "harassment" in various contexts

    New Auto-Interp
    Negative Logits
    rians
    -0.76
    ACTED
    -0.76
    archs
    -0.71
    essential
    -0.71
    éĹĺ
    -0.71
    obb
    -0.71
    arch
    -0.70
    rich
    -0.69
    ramid
    -0.68
    stanbul
    -0.67
    POSITIVE LOGITS
     harass
    1.07
     harassment
    1.02
     harassing
    0.92
     harassed
    0.91
     stalking
    0.84
     accus
    0.78
    assment
    0.78
     tactics
    0.73
    lords
    0.73
     complaints
    0.72
    Act Density 0.017%

    No Known Activations