INDEX
Explanations
references to names and titles
New Auto-Interp
Negative Logits
apot
-0.17
apo
-0.15
insic
-0.15
ainless
-0.14
sit
-0.14
su
-0.13
ÂĬ
-0.13
sh
-0.13
heterosexual
-0.13
ing
-0.13
POSITIVE LOGITS
оÑħ
0.15
iyel
0.15
IPA
0.14
rlen
0.14
λια
0.14
ylim
0.14
repeat
0.14
Plate
0.14
Baghd
0.13
oru
0.13
Activations Density 0.027%