INDEX
    Explanations

    references to moral responsibility and ethical considerations

    New Auto-Interp
    Negative Logits
    à¸ĩหมà¸Ķ
    -0.15
     огÑĢа
    -0.14
    ?,?,?,?,
    -0.14
    ndon
    -0.13
    ãİ
    -0.13
    اگ
    -0.13
    ãģłãģ£ãģ¦
    -0.12
     имÑĥ
    -0.12
    unta
    -0.12
    ghest
    -0.12
    POSITIVE LOGITS
     both
    1.37
    both
    1.26
     Both
    1.20
     BOTH
    1.17
    Both
    1.16
     beide
    0.98
    _both
    0.98
     ambos
    0.95
     обо
    0.82
     obou
    0.81
    Act Density 1.958%

    No Known Activations