INDEX
    Explanations

    phrases related to moral and ethical judgments

    New Auto-Interp
    Negative Logits
    lero
    -0.07
    rip
    -0.07
    Ỽt
    -0.07
    _preds
    -0.06
    šti
    -0.06
    наÑĩе
    -0.06
    гл
    -0.06
    instead
    -0.06
     ضÙħÙĨ
    -0.06
     pena
    -0.06
    POSITIVE LOGITS
     physical
    0.15
    physical
    0.14
     Physical
    0.13
     overt
    0.13
     direct
    0.13
     directly
    0.12
    Physical
    0.12
    direct
    0.11
     obvious
    0.11
     physically
    0.11
    Act Density 0.058%

    No Known Activations