INDEX
    Explanations

    statements that reflect basic standards of human decency and moral judgments

    New Auto-Interp
    Negative Logits
    ALAR
    -0.16
    ani
    -0.15
     SYS
    -0.15
     Cla
    -0.15
    aben
    -0.15
    ÙĨÚ¯
    -0.14
    u
    -0.14
     prom
    -0.14
    t
    -0.14
     Systems
    -0.14
    POSITIVE LOGITS
    egin
    0.16
     Fang
    0.16
     basic
    0.15
    Spoiler
    0.15
    iners
    0.15
    onec
    0.14
    IDENT
    0.14
    é¡Į
    0.14
    DisplayStyle
    0.14
    ków
    0.14
    Act Density 0.201%

    No Known Activations