INDEX
    Explanations

    offensive language and derogatory terms

    New Auto-Interp
    Negative Logits
    kinson
    -0.17
    ิร
    -0.15
    zik
    -0.14
    ners
    -0.14
    urve
    -0.14
    ecided
    -0.14
    zel
    -0.14
    /plugin
    -0.14
    mv
    -0.13
    ÑĥÑĩ
    -0.13
    POSITIVE LOGITS
    YLE
    0.15
    ê¶Į
    0.14
    edd
    0.14
    oten
    0.14
    umb
    0.14
    emouth
    0.13
    aroo
    0.13
     franca
    0.13
    elen
    0.13
    _OPTS
    0.13
    Act Density 0.028%

    No Known Activations