INDEX
Explanations
expressions of surprise or admiration
New Auto-Interp
Negative Logits
illet
-0.15
AAAAAAAA
-0.15
loor
-0.14
à¸Ĺย
-0.14
ÑģÑĮ
-0.14
žen
-0.14
idelberg
-0.13
tega
-0.13
dej
-0.13
emean
-0.13
POSITIVE LOGITS
zers
0.29
zer
0.26
za
0.18
zas
0.17
talk
0.16
Lever
0.15
www
0.15
indr
0.15
outh
0.15
ös
0.15
Activations Density 0.046%