INDEX
    Explanations

    derogatory terms or slurs

    New Auto-Interp
    Negative Logits
    ÙİØ£
    -0.16
    ÙİØ³
    -0.14
    ÏĮÏģ
    -0.14
    ÑĢоз
    -0.14
    hear
    -0.13
    lier
    -0.13
    pdata
    -0.13
    ziej
    -0.13
     Trib
    -0.13
     fon
    -0.13
    POSITIVE LOGITS
    ardin
    0.19
    assed
    0.15
    kp
    0.15
    orde
    0.14
    anine
    0.14
    ayscale
    0.14
    夢
    0.14
    ngo
    0.14
    ż
    0.13
     Whale
    0.13
    Act Density 0.033%

    No Known Activations