INDEX
Explanations
mentions of specific names, likely related to a particular person or topic
proper nouns, particularly names and places
New Auto-Interp
Negative Logits
istically
-0.92
istic
-0.71
ually
-0.67
ities
-0.65
Reconstruction
-0.65
icals
-0.64
ãĥĩ
-0.63
senal
-0.63
occ
-0.61
istical
-0.60
POSITIVE LOGITS
orthy
1.10
riter
1.04
olf
0.99
atcher
0.95
inders
0.95
ritten
0.95
erd
0.94
atson
0.94
orld
0.92
ey
0.90
Activations Density 0.076%