INDEX
Explanations
words related to the concept of "self" or identity
New Auto-Interp
Negative Logits
pars
-0.18
bane
-0.17
p
-0.17
ÏĢοÏĤ
-0.17
ested
-0.17
esch
-0.17
es
-0.16
esan
-0.16
esiz
-0.16
lected
-0.15
POSITIVE LOGITS
OUNT
0.23
plitude
0.21
nesty
0.21
bling
0.20
بÙĪÙĦ
0.19
plit
0.19
pton
0.19
eric
0.19
ERICAN
0.19
bole
0.19
Activations Density 0.069%