INDEX
Explanations
words and phrases related to various student organizations and political themes
New Auto-Interp
Negative Logits
).↵↵
-0.19
):↵
-0.17
):↵↵
-0.16
_______,
-0.16
”↵↵
-0.16
);↵↵
-0.15
"),↵
-0.15
");↵
-0.15
",↵
-0.14
.↵↵↵↵
-0.14
POSITIVE LOGITS
↵
0.42
,
0.35
↵↵
0.30
and
0.29
in
0.28
(
0.26
of
0.26
&
0.26
for
0.25
at
0.25
Activations Density 0.061%