INDEX
    Explanations

    phrases indicating risk and implications of actions on individuals or groups

    New Auto-Interp
    Negative Logits
     refuses
    -0.63
     uttered
    -0.62
    ulously
    -0.61
    nce
    -0.61
     motions
    -0.61
    leys
    -0.60
     forbids
    -0.60
    naires
    -0.59
    gans
    -0.58
     lasts
    -0.58
    POSITIVE LOGITS
     jeopardy
    1.00
    pmwiki
    0.84
     peril
    0.83
    ãĤ´ãĥ³
    0.78
    ãĥ¯ãĥ³
    0.76
    advant
    0.73
     unwelcome
    0.72
     uncomfortable
    0.71
    scape
    0.70
    ãĤ§
    0.68
    Act Density 0.064%

    No Known Activations