INDEX
    Explanations

    phrases related to controversial or offensive language or actions

    New Auto-Interp
    Negative Logits
    frames
    -0.76
    hower
    -0.74
    negie
    -0.73
    runner
    -0.67
    olin
    -0.67
     Stability
    -0.66
    illon
    -0.65
     stabilization
    -0.64
    aea
    -0.63
    itness
    -0.63
    POSITIVE LOGITS
     slurs
    1.38
     insults
    1.10
     slur
    1.06
     insulted
    1.05
     insulting
    1.05
     jokes
    1.03
     remarks
    1.03
     insult
    1.02
     caricature
    1.02
     homophobic
    1.01
    Act Density 0.304%

    No Known Activations