INDEX
Explanations
references to forms of oligarchy and related terminology
New Auto-Interp
Negative Logits
zin
-0.17
scal
-0.16
zure
-0.16
ogi
-0.14
oji
-0.14
inou
-0.13
shops
-0.13
stellen
-0.13
stalk
-0.13
arak
-0.13
POSITIVE LOGITS
exampleInput
0.19
ulet
0.17
Dude
0.16
ìį¨
0.16
reater
0.15
iah
0.15
ãĥ³ãĥī
0.15
olec
0.15
ud
0.14
utherford
0.14
Activations Density 0.002%