INDEX
Explanations
elements related to identity and personal history
New Auto-Interp
Negative Logits
adam
-0.16
ÏĢή
-0.16
大åħ¨
-0.14
ãĤĵãģ©
-0.14
Fuse
-0.14
uco
-0.14
antas
-0.14
uest
-0.13
kå
-0.13
iaux
-0.13
POSITIVE LOGITS
syn
0.28
rod
0.25
Rod
0.23
ÑĢод
0.21
Rod
0.21
brat
0.20
rod
0.20
Syn
0.19
adopt
0.18
rods
0.18
Activations Density 0.024%