INDEX
    Explanations

    phrases that convey admonishment or demands for behavioral change

    New Auto-Interp
    Negative Logits
    .nlm
    -0.19
    .rf
    -0.16
    acen
    -0.15
    ÙĨاÙĨ
    -0.15
    Reuse
    -0.15
    holm
    -0.15
     dwarf
    -0.15
    inge
    -0.14
    arez
    -0.14
    Occurred
    -0.14
    POSITIVE LOGITS
     stop
    0.25
     quit
    0.24
     tough
    0.23
    stop
    0.23
     STOP
    0.22
     Stop
    0.22
    -stop
    0.21
     GT
    0.20
    Stop
    0.20
    _stop
    0.20
    Act Density 0.211%

    No Known Activations