INDEX
Explanations
phrases indicating challenging authority or societal norms
statements about taking risks or challenging societal norms
New Auto-Interp
Negative Logits
usterity
-0.78
CG
-0.67
urgy
-0.67
automatic
-0.65
UST
-0.63
ulative
-0.62
Powered
-0.62
stabilized
-0.62
ulators
-0.60
gearing
-0.60
POSITIVE LOGITS
evil
1.04
defy
0.95
presume
0.83
dare
0.80
disagree
0.79
ously
0.75
argue
0.73
disob
0.73
undertake
0.73
word
0.73
Activations Density 0.034%