INDEX
Explanations
specific words related to societal norms
references to societal norms and expectations
New Auto-Interp
Negative Logits
head
-0.73
wen
-0.72
RET
-0.66
iddler
-0.64
lees
-0.64
Hidden
-0.64
Died
-0.63
del
-0.63
zz
-0.63
lee
-0.63
POSITIVE LOGITS
norms
3.65
norm
1.97
normative
1.66
conventions
1.60
norm
1.59
stereotypes
1.41
standards
1.38
ideals
1.38
Norm
1.37
expectations
1.36
Activations Density 0.017%