INDEX
Explanations
references to moral judgments and ethical considerations
New Auto-Interp
Negative Logits
transQ
-0.61
uxxxx
-0.57
isSet
-0.56
CPL
-0.55
GeoNames
-0.54
arot
-0.53
moveToFirst
-0.53
inaison
-0.53
ویکیآمباردا
-0.52
ANNEL
-0.52
POSITIVE LOGITS
moral
2.89
ethical
2.67
Moral
2.52
moral
2.48
Moral
2.42
ethics
2.32
Ethical
2.29
ethical
2.26
morality
2.21
morals
2.19
Activations Density 0.105%