INDEX
Explanations
expressions of moral responsibility and social justice actions
New Auto-Interp
Negative Logits
loh
-0.18
zsche
-0.15
uluk
-0.15
خارجÙĬØ©
-0.15
ilen
-0.15
ieri
-0.15
deaux
-0.15
uiten
-0.15
ULA
-0.14
ilik
-0.14
POSITIVE LOGITS
us
0.36
ourselves
0.30
we
0.27
æĪij们
0.25
society
0.23
ours
0.23
our
0.21
æĪijåĢij
0.21
everyone
0.21
nós
0.21
Activations Density 0.253%