INDEX
    Explanations

    words related to warnings or alerts, especially those related to potential negative consequences

    references to "triggers" in various contexts

    New Auto-Interp
    Negative Logits
    ensable
    -0.77
    hemat
    -0.75
    apest
    -0.74
    nian
    -0.73
    apolis
    -0.73
     Cutler
    -0.70
    cott
    -0.70
    ately
    -0.69
    ijk
    -0.66
    esan
    -0.66
    POSITIVE LOGITS
    trigger
    0.96
     triggering
    0.96
     warnings
    0.93
     triggers
    0.90
     trigger
    0.81
     Trigger
    0.76
     triggered
    0.74
    witz
    0.73
     Warn
    0.73
     alerts
    0.73
    Act Density 0.050%

    No Known Activations