INDEX
Explanations
statements that emphasize ideological critique or moral positions
New Auto-Interp
Negative Logits
:
-0.16
&&
-0.13
,
-0.13
Ø£ÙĬضا
-0.13
latter
-0.13
ÃĹ
-0.13
také
-0.12
ÙĨÛĮز
-0.12
acman
-0.12
oe
-0.12
POSITIVE LOGITS
namely
0.30
There
0.25
there
0.25
It
0.24
while
0.24
If
0.23
whereas
0.23
Whereas
0.23
While
0.23
it
0.23
Activations Density 0.156%