INDEX
Explanations
words related to ethics, justice, and deserving actions or outcomes
New Auto-Interp
Negative Logits
ullivan
-0.68
ula
-0.62
shr
-0.60
edd
-0.59
plateau
-0.58
cross
-0.57
WI
-0.57
Sidd
-0.57
cycl
-0.56
Es
-0.56
POSITIVE LOGITS
arna
0.90
applause
0.85
precedence
0.84
attention
0.81
FINE
0.81
credit
0.78
consideration
0.77
praise
0.75
dignity
0.74
scrutiny
0.73
Activations Density 0.017%