INDEX
Explanations
names and references to historical figures in academia
New Auto-Interp
Negative Logits
igon
-0.17
Dixon
-0.15
collateral
-0.14
(AT
-0.14
UIL
-0.14
WW
-0.14
sitting
-0.14
uv
-0.14
xon
-0.14
ite
-0.13
POSITIVE LOGITS
LU
0.19
BU
0.17
kü
0.17
doch
0.16
kur
0.16
PU
0.16
CUS
0.16
BUM
0.15
ÅĤu
0.15
λμ
0.15
Activations Density 0.102%