INDEX
Explanations
entities and terms related to authority, academia, and societal structure
New Auto-Interp
Negative Logits
,
-0.49
in
-0.47
and
-0.44
of
-0.44
a
-0.42
to
-0.41
-0.41
(
-0.41
-
-0.40
on
-0.40
POSITIVE LOGITS
_REF
0.26
ç§ģãģ®
0.26
ãģ®ãģĭ
0.25
èµĦæĸĻ
0.25
ãĤ¹ãģ®
0.24
éľĩ
0.24
ãĤĮãģ¦
0.24
درÛĮا
0.23
åľŁåľ°
0.23
ãĤ¤ãĤ¹
0.23
Activations Density 0.060%