INDEX
    Explanations

    phrases related to negative or derogatory terms

    terms associated with critique and negative characterization

    New Auto-Interp
    Negative Logits
    ahon
    -0.79
    chwitz
    -0.72
    earch
    -0.71
    ensive
    -0.71
    large
    -0.71
    arbon
    -0.70
    ij
    -0.70
    range
    -0.68
    angan
    -0.67
    ascript
    -0.67
    POSITIVE LOGITS
     extraord
    1.47
    gery
    1.05
    esses
    1.03
    hood
    1.00
    ry
    0.93
     archetype
    0.91
    liness
    0.87
     persona
    0.87
     who
    0.86
    doms
    0.85
    Act Density 0.360%

    No Known Activations