INDEX
    Explanations

    phrases related to potential dangers, risks, and catastrophic events

    New Auto-Interp
    Negative Logits
     thut
    -1.93
     aen
    -1.93
     effe
    -1.93
     nece
    -1.85
     fte
    -1.84
     fta
    -1.82
     „,
    -1.81
     ?...
    -1.76
     fep
    -1.76
     meis
    -1.76
    POSITIVE LOGITS
     if
    0.80
     due
    0.72
     or
    0.67
    .
    0.66
     unless
    0.66
    if
    0.65
     because
    0.63
    <bos>
    0.63
    roasted
    0.63
    due
    0.62
    Act Density 0.508%

    No Known Activations