INDEX
    Explanations

    phrases related to ethical and moral reasoning

    New Auto-Interp
    Negative Logits
    ajo
    -0.16
    adera
    -0.15
    dar
    -0.15
    endi
    -0.14
    åĨł
    -0.14
    ç»ĻæĪij
    -0.14
    ToBounds
    -0.14
     dar
    -0.14
    IALIZED
    -0.14
    issen
    -0.13
    POSITIVE LOGITS
     against
    0.32
    against
    0.26
     Against
    0.26
    対
    0.25
    Against
    0.21
     fight
    0.21
    对
    0.21
     пÑĢоÑĤив
    0.20
     пÑĢоÑĤи
    0.20
     proti
    0.20
    Act Density 0.396%

    No Known Activations