INDEX
    Explanations

    expressions related to criticism and social accountability

    New Auto-Interp
    Negative Logits
    dv
    -0.16
    orage
    -0.15
    pron
    -0.15
    Blank
    -0.14
    bolt
    -0.14
    ãĥ¡ãĥ©
    -0.14
    ãĥ¼ãĥĨ
    -0.14
     Blank
    -0.14
    æĭį
    -0.14
    ếp
    -0.14
    POSITIVE LOGITS
     stop
    0.25
     Stop
    0.24
    _stop
    0.24
    Stop
    0.23
    stop
    0.22
     quit
    0.22
    -stop
    0.22
     STOP
    0.21
    _STOP
    0.21
    STOP
    0.21
    Act Density 0.263%

    No Known Activations