INDEX
    Explanations

    unwanted or explicit requests

    New Auto-Interp
    Negative Logits
     
    0.96
    !
    0.84
     +
    0.75
    (
    0.74
     better
    0.72
     or
    0.72
    :
    0.71
    +
    0.71
    0.70
     (
    0.69
    POSITIVE LOGITS
     purporting
    1.20
     misog
    0.91
     unwarranted
    0.88
    aksud
    0.88
     сексуа
    0.86
     disrespectful
    0.85
     purportedly
    0.85
     alleging
    0.84
     політи
    0.84
     indiscrimin
    0.84
    Act Density 0.014%

    No Known Activations