INDEX
    Explanations

    language indicating moral outrage and condemnation of unethical behavior

    New Auto-Interp
    Negative Logits
    gii
    -0.15
    oplayer
    -0.14
    ubic
    -0.14
     Erot
    -0.14
    ære
    -0.14
    665
    -0.14
    verage
    -0.14
     hass
    -0.13
     muschi
    -0.13
    éric
    -0.13
    POSITIVE LOGITS
     hide
    0.38
     hor
    0.33
     rep
    0.32
     des
    0.31
     sick
    0.31
    hor
    0.29
    hide
    0.28
     gh
    0.28
     repell
    0.26
     he
    0.26
    Act Density 0.378%

    No Known Activations