INDEX
    Explanations

    mentions of safety-related issues and concerns in various contexts

    New Auto-Interp
    Negative Logits
     Butcher
    -0.61
     Wand
    -0.58
     nude
    -0.57
     Rand
    -0.57
     cameo
    -0.56
    oret
    -0.56
     Naked
    -0.56
     shepherd
    -0.55
     Pound
    -0.55
     soundtrack
    -0.55
    POSITIVE LOGITS
     levels
    0.85
     barriers
    0.77
    itism
    0.76
     morale
    0.75
    flows
    0.75
     pathways
    0.73
    ahime
    0.73
     expectations
    0.72
     nationwide
    0.71
     among
    0.71
    Act Density 0.177%

    No Known Activations