INDEX
Explanations
proper nouns, particularly names of people and characters
New Auto-Interp
Negative Logits
himself
-0.24
Himself
-0.19
Ø®ÙĪØ¯Ø´
-0.16
sám
-0.15
itself
-0.15
kendisi
-0.14
ĵåIJį
-0.14
agli
-0.14
unga
-0.14
umer
-0.13
POSITIVE LOGITS
themselves
0.30
respectively
0.29
alike
0.27
their
0.25
Their
0.24
Their
0.24
两人
0.24
ê°ģê°ģ
0.24
together
0.22
leurs
0.21
Activations Density 0.155%