INDEX
Explanations
names or references to prominent individuals or figures
New Auto-Interp
Negative Logits
extremes
-0.15
Vern
-0.15
Epid
-0.14
chn
-0.14
DAL
-0.14
ses
-0.14
emple
-0.14
châu
-0.14
/share
-0.14
fal
-0.13
POSITIVE LOGITS
peare
0.19
زادÙĩ
0.17
baz
0.16
256
0.16
apult
0.15
akespeare
0.15
iÃŃ
0.15
eldon
0.15
afs
0.15
eneg
0.15
Activations Density 0.024%