INDEX
Explanations
prominent names and figures in various contexts
New Auto-Interp
Negative Logits
ses
-0.21
ence
-0.20
sing
-0.19
olly
-0.17
/fw
-0.16
ned
-0.16
olley
-0.16
oga
-0.15
unds
-0.15
oldem
-0.15
POSITIVE LOGITS
arin
0.17
ito
0.17
pace
0.17
igans
0.16
pawn
0.16
laus
0.15
ãĥ¥
0.15
asaki
0.15
itos
0.14
iana
0.14
Activations Density 0.696%