INDEX
    Explanations

    references to seriousness and safety concerns

    New Auto-Interp
    Negative Logits
    enheim
    -0.16
     ÃŃch
    -0.16
    abby
    -0.15
    #
    -0.14
    iteli
    -0.14
     ìĭľíĸī
    -0.13
    emey
    -0.13
    indsight
    -0.13
    reib
    -0.13
    quine
    -0.13
    POSITIVE LOGITS
     serious
    1.09
     Serious
    0.93
    serious
    0.91
     seriousness
    0.88
     seriously
    0.76
    -ser
    0.69
     seri
    0.67
     Seriously
    0.60
     ÑģеÑĢÑĮез
    0.59
    Ser
    0.58
    Act Density 0.251%

    No Known Activations