INDEX
    Explanations

    profane words or vulgar language

    New Auto-Interp
    Negative Logits
    lihood
    -0.68
    NING
    -0.67
    enegger
    -0.67
    nces
    -0.65
    Äĩ
    -0.63
    risome
    -0.63
    atical
    -0.61
    senal
    -0.61
    manship
    -0.61
    POL
    -0.60
    POSITIVE LOGITS
    ogether
    1.26
    imore
    1.20
    itude
    1.11
    reatment
    0.96
    itudes
    0.95
    uve
    0.92
    zman
    0.86
    itud
    0.82
    ournament
    0.81
    ree
    0.80
    Act Density 0.015%

    No Known Activations