INDEX
    Explanations

    phrases related to actions and their moral implications

    New Auto-Interp
    Negative Logits
    atter
    -0.20
    assa
    -0.17
     boz
    -0.15
    çħ
    -0.15
    erea
    -0.15
    ål
    -0.15
    еÑĤе
    -0.15
    )((((
    -0.15
    ä½³
    -0.14
    ÑĢÑĮ
    -0.14
    POSITIVE LOGITS
     steps
    0.16
     performed
    0.16
    нам
    0.15
    Performed
    0.15
     Gron
    0.15
    hma
    0.14
    COPE
    0.14
    nyder
    0.14
     activities
    0.14
     tasks
    0.14
    Act Density 0.375%

    No Known Activations