INDEX
Explanations
themes related to social structures and cultural elements
New Auto-Interp
Negative Logits
Both
-0.20
Both
-0.19
beide
-0.19
обо
-0.17
BOTH
-0.16
两人
-0.16
537
-0.16
_both
-0.16
両
-0.15
äºĮ人
-0.14
POSITIVE LOGITS
etc
0.32
all
0.31
etc
0.28
çŃī
0.25
—all
0.22
ëĵ±ìĿĦ
0.21
hepsi
0.21
altogether
0.20
tc
0.20
ëĵ±ìĿĺ
0.20
Activations Density 0.504%