INDEX
Explanations
phrases indicating refusal or resistance
instances of the word "refused" and related phrases indicating non-compliance or resistance
New Auto-Interp
Negative Logits
issance
-0.86
////////////////////////////////
-0.80
renheit
-0.66
sav
-0.64
Ca
-0.64
calling
-0.64
Thumbnails
-0.63
imal
-0.63
————————
-0.63
Canary
-0.63
POSITIVE LOGITS
acknowledge
1.38
bud
1.34
concede
1.30
cooperate
1.28
acknow
1.23
accept
1.21
participate
1.18
comply
1.16
obey
1.16
admit
1.15
Activations Density 0.075%