INDEX
Explanations
names and titles of individuals
names of individuals or proper nouns
New Auto-Interp
Negative Logits
FANT
-0.78
BEFORE
-0.77
Firstly
-0.66
ONE
-0.66
HERE
-0.65
CAN
-0.65
NEVER
-0.65
Tonight
-0.63
OVER
-0.62
TYPE
-0.62
POSITIVE LOGITS
likewise
0.83
maxwell
0.74
bernatorial
0.72
langu
0.71
aeus
0.68
another
0.67
meanwhile
0.67
similarly
0.66
hemoth
0.65
yang
0.63
Activations Density 0.432%