INDEX
    Explanations

    words related to controversial or negative information, particularly regarding political or societal issues

    New Auto-Interp
    Negative Logits
    arts
    -0.80
    ulton
    -0.79
     wisely
    -0.73
     Adviser
    -0.72
    ĺħ
    -0.72
    ĸļ
    -0.71
    agine
    -0.69
    gerald
    -0.68
    aido
    -0.68
    nan
    -0.68
    POSITIVE LOGITS
     hostility
    0.86
     disregard
    0.81
     racism
    0.78
     malice
    0.75
     refusal
    0.74
     contradiction
    0.73
     denial
    0.72
     rejection
    0.72
     ban
    0.71
     sexism
    0.71
    Act Density 0.022%

    No Known Activations