INDEX
    Explanations

    proper names of individuals

    names of individuals and instances of derogatory or disparaging language

    New Auto-Interp
    Negative Logits
    Tok
    -0.82
    aido
    -0.72
    bos
    -0.71
     replication
    -0.70
     Tok
    -0.70
    rawdownloadcloneembedreportprint
    -0.68
    slave
    -0.68
    Phase
    -0.68
    BF
    -0.68
     injection
    -0.68
    POSITIVE LOGITS
     Moreno
    2.13
     Hayden
    2.04
     derogatory
    1.58
     dispar
    1.32
     depl
    1.26
     Flat
    1.24
     Chin
    1.04
     Jerome
    0.97
     Fern
    0.94
     Chester
    0.94
    Act Density 0.042%

    No Known Activations