INDEX
    Explanations

    names of individuals

    references to specific individuals and moral implications

    New Auto-Interp
    Negative Logits
    nets
    -0.86
    sonian
    -0.86
    pack
    -0.76
    liners
    -0.75
    jri
    -0.74
    unders
    -0.73
    gers
    -0.73
    acement
    -0.70
    lets
    -0.69
    enegger
    -0.69
    POSITIVE LOGITS
    terday
    0.75
     surv
    0.73
    hyde
    0.72
    HAEL
    0.71
    utical
    0.67
    ajor
    0.67
    ouched
    0.66
    VICE
    0.64
    Ba
    0.64
     wrestle
    0.62
    Act Density 0.028%

    No Known Activations