INDEX
Explanations
mentions of unethical behavior and discrimination in social contexts
New Auto-Interp
Negative Logits
helicop
-0.66
cryst
-0.61
artney
-0.61
respectively
-0.59
oother
-0.58
itored
-0.58
combe
-0.58
querque
-0.57
prest
-0.57
challeng
-0.56
POSITIVE LOGITS
coward
0.76
hypocrisy
0.72
modesty
0.68
liberals
0.67
dare
0.66
bigotry
0.66
blaming
0.66
feminists
0.65
offended
0.65
dishon
0.64
Activations Density 0.523%