INDEX
    Explanations

    pronouns and possessive words

    references to individuals and their roles in various scenarios

    New Auto-Interp
    Negative Logits
    venge
    -0.80
     righteous
    -0.69
     GOODMAN
    -0.68
     triumphant
    -0.65
    STON
    -0.64
    liber
    -0.64
    irming
    -0.63
    ovy
    -0.63
     congratulations
    -0.63
    knit
    -0.63
    POSITIVE LOGITS
     lacked
    1.84
     lacks
    1.62
     failed
    1.54
     cannot
    1.51
     forgot
    1.46
     underestimated
    1.42
     refused
    1.39
     incorrectly
    1.38
    failed
    1.38
     couldn
    1.37
    Act Density 0.730%

    No Known Activations