INDEX
Explanations
binary choices or decisions in text
negative responses or indicators of disapproval
New Auto-Interp
Negative Logits
anim
-0.64
transpl
-0.62
bes
-0.62
wand
-0.60
arts
-0.60
Middle
-0.58
Wand
-0.58
appet
-0.57
handcuffs
-0.57
effic
-0.57
POSITIVE LOGITS
NO
4.13
NO
2.49
YES
1.75
no
1.71
YES
1.44
NI
1.30
ANY
1.20
VO
1.19
SO
1.17
DON
1.17
Activations Density 0.017%