INDEX
Explanations
prominent figures in various fields
titles and roles of individuals
New Auto-Interp
Negative Logits
justifies
-0.74
exceeds
-0.73
threatens
-0.70
chery
-0.70
destroys
-0.70
excludes
-0.69
styles
-0.69
eliminates
-0.66
overwhel
-0.66
rushes
-0.66
POSITIVE LOGITS
holiest
0.84
acronym
0.80
largest
0.76
oldest
0.75
proud
0.73
igmatic
0.71
Mub
0.69
bedrock
0.69
quartered
0.68
latest
0.68
Activations Density 0.265%