INDEX
    Explanations

    proper nouns referring to individuals

    mentions of historical figures and their affiliations

    New Auto-Interp
    Negative Logits
    !!!!!
    -0.63
    !!!!!!!!
    -0.59
    "!
    -0.53
    !!!
    -0.53
    !!!!
    -0.53
     PTS
    -0.52
    ravings
    -0.51
    `.
    -0.51
    ':
    -0.51
    "]=>
    -0.51
    POSITIVE LOGITS
    *)
    0.71
    })
    0.69
    )]
    0.63
    )—
    0.62
    )}
    0.62
     )]
    0.60
    )[
    0.60
    ?)
    0.60
    )\
    0.60
     fame
    0.59
    Act Density 1.649%

    No Known Activations