INDEX
    Explanations

    expletives or strong vulgar language

    New Auto-Interp
    Negative Logits
    Ĭ±
    -0.73
     ancest
    -0.71
     mosqu
    -0.70
     pestic
    -0.67
    utterstock
    -0.67
    æĪ¦
    -0.66
     rece
    -0.66
     reluct
    -0.64
    Decre
    -0.64
     reper
    -0.63
    POSITIVE LOGITS
    hole
    1.18
    holes
    1.15
    tty
    1.02
    king
    0.98
    cking
    0.94
    kers
    0.92
    shit
    0.88
    gger
    0.88
    fuck
    0.85
    ing
    0.85
    Act Density 0.009%

    No Known Activations