INDEX
    Explanations

    negative content related to violence, discrimination, and offensive language

    references to social issues and intolerance towards various identity groups

    New Auto-Interp
    Negative Logits
    onen
    -0.57
     Mous
    -0.56
    noon
    -0.52
     Dangerous
    -0.52
     Nich
    -0.51
     Piper
    -0.51
    lyak
    -0.50
    20439
    -0.50
     Passage
    -0.50
     Architects
    -0.50
    POSITIVE LOGITS
     etc
    1.20
    etc
    1.01
    â̦)
    0.84
    whatever
    0.71
     ect
    0.68
    â̦
    0.65
    cknow
    0.61
    cheat
    0.60
     welf
    0.58
     â̦
    0.58
    Act Density 0.372%

    No Known Activations