INDEX
Explanations
references to notable figures or female characters
New Auto-Interp
Negative Logits
lund
-0.16
zzle
-0.14
buat
-0.14
rish
-0.14
orra
-0.14
gba
-0.14
antz
-0.14
/tos
-0.14
charms
-0.14
tw
-0.14
POSITIVE LOGITS
Esper
0.28
Ä
0.26
aj
0.25
ling
0.19
Ä
0.19
oj
0.18
igit
0.18
alling
0.18
mall
0.18
esper
0.17
Activations Density 0.001%