INDEX
Explanations
negative adjectives related to immoral or unethical behavior
words that describe morally reprehensible actions or qualities
New Auto-Interp
Negative Logits
essor
-0.75
iets
-0.71
stabilize
-0.71
ersion
-0.71
aver
-0.71
knit
-0.69
eding
-0.66
chi
-0.66
ovember
-0.66
hner
-0.66
POSITIVE LOGITS
despicable
0.88
hypocrisy
0.82
injustice
0.81
vile
0.80
blasp
0.79
slander
0.78
deeds
0.77
disgusting
0.75
ifiable
0.75
ly
0.75
Activations Density 0.099%