INDEX
Explanations
repeated phrases or descriptors referring to people, institutions, or events
New Auto-Interp
Negative Logits
uild
-0.15
uent
-0.15
ideon
-0.14
nze
-0.14
yer
-0.14
yb
-0.14
suming
-0.14
zens
-0.14
abis
-0.13
LError
-0.13
POSITIVE LOGITS
late
0.35
late
0.30
man
0.27
incom
0.26
Late
0.26
estim
0.26
son
0.23
irre
0.22
Late
0.21
ever
0.20
Activations Density 0.223%