INDEX
    Explanations

    programmed to avoid hate speech

    New Auto-Interp
    Negative Logits
     snapped
    0.45
     fairly
    0.42
     replicated
    0.42
     visited
    0.40
    aint
    0.40
     পরপর
    0.39
    ंता
    0.39
     skipper
    0.39
     sn
    0.38
     replaced
    0.38
    POSITIVE LOGITS
     Informatika
    0.49
    })$;
    0.46
     $*$-
    0.46
     Islamist
    0.45
     書い
    0.45
     Disse
    0.45
     alá
    0.45
    нга
    0.45
    的想法
    0.45
     respectivos
    0.44
    Act Density 0.001%

    No Known Activations