INDEX
Explanations
words and phrases related to moral evaluations and social justice concepts
New Auto-Interp
Negative Logits
اÙħØ©
-0.16
epad
-0.16
ongan
-0.16
lse
-0.15
пов
-0.14
Äįan
-0.14
etter
-0.14
inke
-0.14
ched
-0.14
elle
-0.13
POSITIVE LOGITS
igua
0.17
/documentation
0.15
awns
0.15
sen
0.15
eor
0.15
.Bounds
0.14
rem
0.14
itos
0.14
unya
0.14
phil
0.14
Activations Density 0.022%