INDEX
Explanations
phrases related to rule-breaking or bending social norms
negative phrases or statements
New Auto-Interp
Negative Logits
caution
-0.67
flirt
-0.67
Dickinson
-0.67
Arabian
-0.65
hiber
-0.64
pomp
-0.62
shares
-0.61
curtain
-0.61
troop
-0.60
adm
-0.60
POSITIVE LOGITS
turned
1.29
cum
1.17
sama
1.14
selves
1.13
sized
1.12
style
1.12
related
1.11
type
1.06
induced
1.05
san
1.04
Activations Density 0.164%