INDEX
    Explanations

    wrong, unethical, disrespectful, problematic

    New Auto-Interp
    Negative Logits
     pressured
    0.71
     influenz
    0.71
     🙂
    0.70
    ุงเทพ
    0.70
     Advantage
    0.67
     risky
    0.66
     kesulitan
    0.66
     advantage
    0.66
     منفی
    0.65
    보다는
    0.63
    POSITIVE LOGITS
     abhor
    1.33
     heinous
    1.33
     egregious
    1.30
     abomin
    1.29
     violation
    1.28
     affront
    1.24
     atrocious
    1.21
     outrage
    1.21
     disgraceful
    1.19
     disgrace
    1.18
    Act Density 0.380%

    No Known Activations