INDEX
    Explanations

    negative adjectives and phrases related to criticism and bias

    terminology related to falsehoods and negative attributes in reports or statements

    New Auto-Interp
    Negative Logits
    cade
    -0.84
    ynthesis
    -0.83
    lear
    -0.80
     Waves
    -0.78
    onds
    -0.78
    uese
    -0.77
    yles
    -0.77
    eatures
    -0.77
    ESE
    -0.76
    lights
    -0.75
    POSITIVE LOGITS
     disrespectful
    1.43
     unethical
    1.43
     immoral
    1.41
     irresponsible
    1.41
     wasteful
    1.35
     prejud
    1.34
     hypocritical
    1.34
     counterproductive
    1.33
     sexist
    1.31
     racist
    1.30
    Act Density 0.269%

    No Known Activations