INDEX
Explanations
references to sibling relationships and familial connections
New Auto-Interp
Negative Logits
himself
-0.19
idon
-0.17
ãĤĴãģĭ
-0.16
idal
-0.16
_IMPL
-0.15
izen
-0.14
INCLUDED
-0.14
zi
-0.14
Himself
-0.14
ÙĨÙ쨳Ùĩ
-0.14
POSITIVE LOGITS
themselves
0.29
Their
0.18
Their
0.17
eor
0.17
team
0.17
collectively
0.16
Ñģами
0.16
their
0.16
dyn
0.15
yourselves
0.15
Activations Density 0.425%