INDEX
Explanations
punctuation and question constructs in the text
New Auto-Interp
Negative Logits
ÙĴÙĩ
-0.15
kaar
-0.15
ãĥ¼ãĥĨ
-0.15
avier
-0.15
iron
-0.14
ikip
-0.14
ostel
-0.14
oll
-0.14
_SLAVE
-0.14
iverse
-0.13
POSITIVE LOGITS
rita
0.18
æ´¥
0.15
556
0.15
ervo
0.15
æł¹
0.15
ahlen
0.15
ritt
0.14
ãĥ¼ãĥĸ
0.14
somewhere
0.14
nothing
0.14
Activations Density 0.010%