INDEX
    Explanations

    references to moral concepts and values

    New Auto-Interp
    Negative Logits
    ']")
    -0.72
     fingertips
    -0.68
    ]),
    
    -0.67
    -0.67
     Verk
    -0.66
     المقد
    -0.65
     viewWillAppear
    -0.64
     *
    
    -0.63
    ()*
    -0.62
    })->
    -0.61
    POSITIVE LOGITS
    Mor
    2.18
    mor
    2.10
     Mor
    2.08
     MOR
    1.97
     mor
    1.95
    MOR
    1.84
     moral
    1.80
     Moral
    1.74
     morales
    1.71
    moral
    1.70
    Act Density 0.068%

    No Known Activations