INDEX
Explanations
proper nouns
proper names, particularly those related to individuals and notable figures
New Auto-Interp
Negative Logits
ilater
-0.90
ulators
-0.81
urities
-0.78
indo
-0.76
ulates
-0.73
ular
-0.67
imony
-0.67
ulatory
-0.66
ifier
-0.65
ebus
-0.65
POSITIVE LOGITS
Doyle
1.32
oyle
1.19
idge
0.87
hyde
0.85
hiba
0.85
ragon
0.78
weed
0.77
mount
0.74
brush
0.74
gaard
0.73
Activations Density 0.008%