INDEX
Explanations
references to moral values or moral concepts
New Auto-Interp
Negative Logits
كومونز
-0.84
ⓧ
-0.74
.";
-0.68
_));
-0.64
MacKenzie
-0.63
spillage
-0.63
Ой
-0.61
ddelweddau
-0.60
Chavez
-0.60
proken
-0.59
POSITIVE LOGITS
Moral
0.83
moral
0.76
PerformLayout
0.76
ulemon
0.75
moral
0.73
PreferredItem
0.73
оле
0.73
Mor
0.72
Morrison
0.71
Morality
0.69
Activations Density 0.002%