INDEX
    Explanations

    words related to negativity or disgrace

    words and phrases indicating moral judgment or condemnation

    New Auto-Interp
    Negative Logits
    ħĭ
    -0.74
     Leilan
    -0.69
    king
    -0.65
    izoph
    -0.63
    stood
    -0.62
     Immunity
    -0.60
    plane
    -0.60
    wan
    -0.59
    draw
    -0.58
    llan
    -0.58
    POSITIVE LOGITS
    ations
    1.21
    omin
    1.21
    atory
    1.05
    ational
    0.99
    ious
    0.96
    ifer
    0.96
    ator
    0.96
    itives
    0.93
    omial
    0.93
    atus
    0.93
    Act Density 0.019%

    No Known Activations