INDEX
Explanations
references to specific articles or publications
New Auto-Interp
Negative Logits
addy
-0.21
oor
-0.16
anax
-0.16
aktion
-0.15
rish
-0.15
anki
-0.15
aurus
-0.15
arkers
-0.15
ansa
-0.14
lder
-0.14
POSITIVE LOGITS
zi
0.22
rey
0.19
iks
0.18
ignum
0.17
yer
0.16
sou
0.16
ogram
0.16
zy
0.15
ÃŃ
0.15
з
0.15
Activations Density 0.030%