INDEX
    Explanations

    words associated with risks, consequences, and the importance of safety in various contexts

    New Auto-Interp
    Negative Logits
    selves
    -0.64
    ovember
    -0.60
    enegger
    -0.59
    olulu
    -0.57
    +.
    -0.56
    poon
    -0.55
    iolet
    -0.54
    ornings
    -0.54
     Ago
    -0.54
    ECA
    -0.54
    POSITIVE LOGITS
    iest
    0.81
     varies
    0.80
     becomes
    0.75
     consists
    0.72
     remains
    0.71
     is
    0.71
     goes
    0.69
     itself
    0.69
     reaches
    0.67
     disappears
    0.67
    Act Density 0.276%

    No Known Activations