INDEX
Explanations
male first names
proper nouns, specifically names of individuals and places
New Auto-Interp
Negative Logits
lightly
-0.57
âĸ
-0.56
lockdown
-0.55
nutshell
-0.54
OPLE
-0.53
à¨
-0.53
ãĤ¼ãĤ¦ãĤ¹
-0.52
Democr
-0.52
pleas
-0.51
DragonMagazine
-0.51
POSITIVE LOGITS
ona
0.83
zes
0.71
avia
0.67
ola
0.65
ys
0.63
atus
0.62
ewater
0.62
zen
0.61
azz
0.60
razen
0.60
Activations Density 0.575%