INDEX
Explanations
concepts related to morality and ethical behavior
New Auto-Interp
Negative Logits
"]);
-0.61
#+#
-0.54
%";
-0.53
jectures
-0.53
nocache
-0.51
rophes
-0.51
})
-0.51
});
-0.50
quiera
-0.50
itaire
-0.50
POSITIVE LOGITS
moral
1.12
Moral
1.10
moral
1.05
morals
1.01
Moral
1.01
morality
1.00
morally
0.95
righteousness
0.92
ethical
0.89
ethics
0.86
Activations Density 0.553%