INDEX
    Explanations

    and flag instances of toxic terms or descriptions

    references to toxic substances and their effects

    New Auto-Interp
    Negative Logits
    gain
    -0.83
    hung
    -0.76
    FORE
    -0.74
    AUT
    -0.73
    bler
    -0.71
    ploma
    -0.70
    quart
    -0.70
    BO
    -0.70
    stand
    -0.70
    Month
    -0.69
    POSITIVE LOGITS
     poisoning
    1.00
    ologist
    0.99
     toxic
    0.95
    ologically
    0.95
     substances
    0.91
    ological
    0.90
     fumes
    0.89
     masculinity
    0.89
    ologists
    0.88
    ology
    0.86
    Act Density 0.008%

    No Known Activations