INDEX
Explanations
descriptions of actions or situations that are considered unacceptable
terms related to unacceptability and moral judgment
New Auto-Interp
Negative Logits
Insight
-0.70
craft
-0.66
Fortune
-0.65
Mov
-0.65
mus
-0.63
ier
-0.63
stone
-0.60
Born
-0.60
Heal
-0.59
Speed
-0.59
POSITIVE LOGITS
unacceptable
3.36
intolerable
2.29
acceptable
1.98
undesirable
1.84
unsustainable
1.72
inappropriate
1.66
objectionable
1.65
appalling
1.61
unbearable
1.61
acceptable
1.57
Activations Density 0.021%