INDEX
    Explanations

    terms related to clear and unambiguous actions or statements

    instances of the word "outright" indicating a strong affirmation or assertion

    New Auto-Interp
    Negative Logits
    ulton
    -0.78
    agine
    -0.78
    ĺħ
    -0.75
    nan
    -0.72
    arts
    -0.71
    anners
    -0.71
    anwhile
    -0.71
    ramid
    -0.67
     Neighbor
    -0.67
    nesota
    -0.65
    POSITIVE LOGITS
    Introduced
    0.76
    ãĤ¦ãĤ¹
    0.75
     hostility
    0.73
    shown
    0.71
    eless
    0.70
    Discuss
    0.69
    iary
    0.68
    itarian
    0.68
     refusal
    0.66
    ãĥ
    0.65
    Act Density 0.015%

    No Known Activations