INDEX
    Explanations

    phrases indicating understanding, justification, or explanation

    phrases indicating comprehension or reasonableness of actions

    New Auto-Interp
    Negative Logits
    etry
    -0.73
    esh
    -0.70
    eki
    -0.69
    pee
    -0.67
    eng
    -0.66
    infect
    -0.66
    ngth
    -0.65
    andon
    -0.63
     resh
    -0.62
    cler
    -0.62
    POSITIVE LOGITS
     Mellon
    0.83
     indignation
    0.80
    DragonMagazine
    0.75
    FontSize
    0.74
    NPR
    0.74
    cffffcc
    0.72
     understandable
    0.70
     outrage
    0.68
    >>\
    0.68
    ¶
    0.67
    Act Density 0.048%

    No Known Activations