INDEX
Explanations
words related to moral judgment, often focusing on negative aspects such as immorality and irresponsibility
terms related to moral judgment and irresponsibility
New Auto-Interp
Negative Logits
erville
-0.76
inder
-0.75
aina
-0.74
ppa
-0.72
achine
-0.67
opa
-0.67
estone
-0.66
adal
-0.65
adders
-0.64
glas
-0.64
POSITIVE LOGITS
wasteful
0.80
undermin
0.76
folly
0.74
foolish
0.73
irresponsible
0.70
perverse
0.70
indisc
0.68
aber
0.68
hypocritical
0.66
selfish
0.66
Activations Density 0.024%