INDEX
Explanations
proper nouns related to names or titles
New Auto-Interp
Negative Logits
ãĥĪ
-0.86
ctions
-0.84
ãĥīãĥ©
-0.80
kered
-0.77
ctive
-0.75
ivia
-0.72
ãĥ¬
-0.70
ãĥ©
-0.69
ctory
-0.69
ãĥ¤
-0.68
POSITIVE LOGITS
robe
2.01
ens
1.20
ynski
1.07
wick
0.98
ell
0.96
een
0.96
er
0.95
age
0.91
lest
0.90
chester
0.90
Activations Density 0.043%