INDEX
    Explanations

    expressions of hatred or strong negative sentiments towards individuals or groups

    New Auto-Interp
    Negative Logits
    /latest
    -0.17
    691
    -0.16
    AdapterFactory
    -0.16
     mates
    -0.15
    ddit
    -0.15
    ces
    -0.15
    leanup
    -0.15
    uckle
    -0.14
    oleans
    -0.14
    gency
    -0.14
    POSITIVE LOGITS
    irl
    0.15
    GLE
    0.14
    rus
    0.14
    /env
    0.14
    rypto
    0.14
    AKE
    0.14
    yne
    0.14
    is
    0.14
    ży
    0.14
    mis
    0.13
    Act Density 0.107%

    No Known Activations