INDEX
    Explanations

    phrases related to insults

    references to insults and derogatory language

    New Auto-Interp
    Negative Logits
    negie
    -0.73
    ills
    -0.73
    arten
    -0.70
    enfranch
    -0.67
    20439
    -0.65
    frames
    -0.64
    iggle
    -0.64
    ulhu
    -0.63
    olin
    -0.63
    illon
    -0.62
    POSITIVE LOGITS
     insult
    1.02
     insults
    0.96
     insulted
    0.95
     disrespect
    0.93
     insulting
    0.90
     humour
    0.83
     caric
    0.82
    ingly
    0.81
     humili
    0.80
     hygiene
    0.76
    Act Density 0.092%

    No Known Activations