INDEX
Explanations
instances of refusal or negation actions
New Auto-Interp
Negative Logits
setVerticalGroup
-0.89
iprot
-0.84
CreateTagHelper
-0.69
hoeddwyd
-0.66
发表于
-0.61
LookAnd
-0.61
silly
-0.59
fjspx
-0.58
giggle
-0.58
елның
-0.57
POSITIVE LOGITS
refuse
1.21
refused
1.20
refuses
1.18
refusal
1.15
Refuse
1.12
refusing
1.12
hesitate
0.98
refus
0.80
refuser
0.79
hesitation
0.77
Activations Density 0.084%