INDEX
Explanations
references to morality, moral beliefs, and religious values
New Auto-Interp
Negative Logits
biệt
-0.52
évaluateur
-0.52
RegressionTest
-0.50
ipedi
-0.50
DeleteCommand
-0.50
eira
-0.50
TREAT
-0.47
eip
-0.45
solstice
-0.45
TREAT
-0.44
POSITIVE LOGITS
morals
1.07
moral
0.89
Moral
0.86
pious
0.80
moral
0.79
Moral
0.79
morality
0.77
morales
0.75
ethical
0.73
olesome
0.73
Activations Density 0.369%