INDEX
    Explanations

    language related to expressions of hate or derogatory comments directed at individuals or groups

    New Auto-Interp
    Negative Logits
    AppCompat
    -0.57
     stories
    -0.52
    文章
    -0.51
     document
    -0.51
     note
    -0.50
     detal
    -0.50
     detail
    -0.50
    FileDescriptor
    -0.50
     chronicles
    -0.49
    وض
    -0.47
    POSITIVE LOGITS
     uttered
    0.92
     uttering
    0.75
     utterances
    0.71
    glGen
    0.70
    makeConstraints
    0.69
    ConstraintMaker
    0.68
     utterance
    0.67
    DispatchToProps
    0.66
    فاده
    0.66
     GenerationType
    0.66
    Act Density 0.098%

    No Known Activations