INDEX
Explanations
references to groups of people or individuals
New Auto-Interp
Negative Logits
itself
-0.31
(es
-0.16
ayne
-0.16
its
-0.15
quine
-0.14
ãĤ¹ãĥŀ
-0.14
ÑĹ
-0.14
irection
-0.14
ering
-0.13
اÙĨÙĩ
-0.13
POSITIVE LOGITS
/us
0.41
/her
0.30
self
0.29
atically
0.28
themselves
0.25
/th
0.25
/we
0.24
elves
0.24
iner
0.23
zelf
0.23
Activations Density 0.097%