INDEX
    Explanations

    unethical, harmful, disrespectful, unprofessional

    New Auto-Interp
    Negative Logits
     `>=
    0.40
     fonts
    0.40
     Fonts
    0.39
     সতর্ক
    0.39
     সতর্কতা
    0.39
    Sweet
    0.38
    苦手
    0.38
     성능
    0.38
     Sweet
    0.38
    Performance
    0.38
    POSITIVE LOGITS
     disrespectful
    0.57
     affront
    0.52
     tantamount
    0.49
     irresponsible
    0.48
    是一种
    0.46
     insulting
    0.45
     insult
    0.45
    would
    0.45
     taman
    0.44
     shameful
    0.44
    Act Density 0.074%

    No Known Activations