INDEX
Explanations
proper nouns referring to individuals
mentions of historical figures and their affiliations
New Auto-Interp
Negative Logits
!!!!!
-0.63
!!!!!!!!
-0.59
"!
-0.53
!!!
-0.53
!!!!
-0.53
PTS
-0.52
ravings
-0.51
`.
-0.51
':
-0.51
"]=>
-0.51
POSITIVE LOGITS
*)
0.71
})
0.69
)]
0.63
)—
0.62
)}
0.62
)]
0.60
)[
0.60
?)
0.60
)\
0.60
fame
0.59
Activations Density 1.649%