INDEX
Explanations
instances of willingness or reluctance to take specific actions or stances
New Auto-Interp
Negative Logits
Surv
-0.83
Wrap
-0.73
MpServer
-0.72
Tid
-0.72
verbs
-0.71
Gleaming
-0.68
unes
-0.68
ordon
-0.65
76561
-0.63
nesium
-0.63
POSITIVE LOGITS
lessness
0.90
willingness
0.86
unwillingness
0.82
adherence
0.81
attitude
0.81
jriwal
0.79
rate
0.77
fulness
0.76
compliance
0.72
refusal
0.72
Activations Density 0.064%