INDEX
    Explanations

    terms related to unethical or exploitative behavior

    terms related to profanity and unethical behavior

    New Auto-Interp
    Negative Logits
    empty
    -0.77
    warm
    -0.75
    wolves
    -0.75
    20439
    -0.74
    forth
    -0.72
    ment
    -0.68
    MENTS
    -0.67
    WAY
    -0.67
    WAYS
    -0.66
    DAY
    -0.65
    POSITIVE LOGITS
     prof
    1.41
     mathemat
    1.02
     thous
    0.95
    luent
    0.92
    licted
    0.89
    eatures
    0.88
     predec
    0.87
    inances
    0.85
    essor
    0.84
     concess
    0.83
    Act Density 0.006%

    No Known Activations