INDEX
    Explanations

    language related to swearing and racial slurs

    New Auto-Interp
    Negative Logits
    orgh
    -0.17
    hare
    -0.14
    jezd
    -0.14
    ÑģÑĤоÑĢ
    -0.13
    yre
    -0.13
    957
    -0.13
    æĭ
    -0.13
     yearly
    -0.13
     Wilkinson
    -0.13
    qq
    -0.13
    POSITIVE LOGITS
     prof
    0.49
     swear
    0.46
     curse
    0.44
     swearing
    0.43
     curs
    0.40
     curses
    0.36
     Prof
    0.36
    prof
    0.35
     obsc
    0.34
     Curse
    0.34
    Act Density 0.164%

    No Known Activations