INDEX
Explanations
references to historical or mythological figures and events
New Auto-Interp
Negative Logits
izu
-0.17
abei
-0.15
lah
-0.14
енз
-0.14
haf
-0.14
datum
-0.14
atif
-0.14
ált
-0.14
diag
-0.14
Domino
-0.14
POSITIVE LOGITS
Pand
0.22
Vy
0.19
Hast
0.19
sage
0.18
Vide
0.17
Utt
0.17
Dw
0.16
Vir
0.16
Vir
0.16
Dra
0.16
Activations Density 0.059%