INDEX
Explanations
proper nouns specifically people's names
references to specific names and proper nouns
New Auto-Interp
Negative Logits
inately
-0.91
è¦ļéĨĴ
-0.73
Canary
-0.67
DISTR
-0.66
prevailing
-0.65
seeker
-0.64
é¾įå¥ij士
-0.64
compr
-0.63
¥µ
-0.62
blanket
-0.61
POSITIVE LOGITS
ny
1.19
Diesel
1.09
eland
0.97
ita
0.96
ned
0.94
lass
0.93
iti
0.91
ificial
0.89
omial
0.88
ners
0.87
Activations Density 0.030%