INDEX
    Explanations

    phrases indicating that something is morally incorrect or unacceptable

    phrases indicating the absence of wrongdoing

    New Auto-Interp
    Negative Logits
    incinn
    -0.82
    cit
    -0.81
    soType
    -0.77
    xit
    -0.70
    earchers
    -0.69
    weeney
    -0.68
    ulum
    -0.66
    wit
    -0.66
    cum
    -0.65
     Ri
    -0.64
    POSITIVE LOGITS
    headed
    0.75
    mouth
    0.74
     wrong
    0.70
     wing
    0.69
    eous
    0.68
     havoc
    0.67
     behaviour
    0.65
    nered
    0.63
     flank
    0.63
    doing
    0.62
    Act Density 0.011%

    No Known Activations