INDEX
Explanations
adjectives describing qualities of behavior or actions
assertions about societal issues related to morality and exploitation
New Auto-Interp
Negative Logits
Spec
-0.86
ITNESS
-0.82
spec
-0.76
specs
-0.73
redits
-0.70
entials
-0.68
pletion
-0.67
pleted
-0.66
uilt
-0.66
maps
-0.66
POSITIVE LOGITS
insidious
1.32
tactic
1.22
hypocrisy
1.21
hypocritical
1.16
endemic
1.15
coward
1.11
despicable
1.10
rampant
1.09
tactics
1.09
pervasive
1.08
Activations Density 0.554%