INDEX
    Explanations

    references to harmful or negative terms and concepts

    New Auto-Interp
    Negative Logits
    ird
    -0.16
    olia
    -0.15
    orrent
    -0.15
    rowse
    -0.15
    .Reporting
    -0.14
    lined
    -0.14
    /videos
    -0.14
    ossible
    -0.14
    song
    -0.14
    ibo
    -0.14
    POSITIVE LOGITS
    ädchen
    0.17
    ously
    0.17
    uous
    0.16
    ingly
    0.15
    amac
    0.15
    buster
    0.15
    ometer
    0.14
    raki
    0.14
    ioctl
    0.14
    ably
    0.14
    Act Density 0.648%

    No Known Activations