INDEX
Explanations
references to systemic injustice and commentary on societal inequalities
New Auto-Interp
Negative Logits
ertas
-0.16
ubar
-0.15
ÏģÏī
-0.14
(strtolower
-0.14
idth
-0.14
bjerg
-0.13
oldt
-0.13
iesz
-0.13
ãĤµãĤ¤
-0.13
миÑĤ
-0.13
POSITIVE LOGITS
simply
0.28
while
0.27
without
0.26
merely
0.24
solely
0.24
even
0.22
despite
0.22
mere
0.21
using
0.21
under
0.21
Activations Density 0.763%