INDEX
Explanations
the concept of self-reference in language
New Auto-Interp
Negative Logits
ogn
-0.16
asil
-0.15
agy
-0.15
stub
-0.15
uty
-0.14
ware
-0.14
ardin
-0.14
.sponge
-0.14
ilded
-0.13
ayo
-0.13
POSITIVE LOGITS
itself
0.18
semblies
0.16
chaft
0.15
ï¸ı
0.15
-même
0.14
tones
0.14
endoza
0.14
ighton
0.14
ptal
0.14
ään
0.14
Activations Density 0.043%