INDEX
    Explanations

    references to the speaker or first-person perspective

    New Auto-Interp
    Negative Logits
     partName
    -0.70
    vati
    -0.69
    irlf
    -0.68
     motivations
    -0.66
     motiv
    -0.66
    aturday
    -0.65
     motivating
    -0.63
    leys
    -0.62
     srfAttach
    -0.61
     motivated
    -0.61
    POSITIVE LOGITS
     paraph
    1.01
     forget
    0.92
     typo
    0.80
     forgot
    0.79
     forgetting
    0.79
     dunno
    0.71
     swear
    0.70
     mean
    0.69
    LV
    0.68
     ours
    0.67
    Act Density 0.572%

    No Known Activations