INDEX
Explanations
words associated with specific human characters or identities
New Auto-Interp
Negative Logits
-0.90
propOrder
-0.75
wikipagina
-0.73
Wikidata
-0.69
Huguen
-0.67
hudson
-0.67
Phry
-0.66
itſelf
-0.65
Houſe
-0.64
Esau
-0.63
POSITIVE LOGITS
the
1.51
The
1.37
THE
1.34
The
1.33
enthe
1.13
sthe
1.09
THE
1.07
rethe
1.06
entire
0.98
the
0.97
Activations Density 0.042%