INDEX
Explanations
expressions of denial or refusal
New Auto-Interp
Negative Logits
texttt
-0.71
таж
-0.66
Hig
-0.64
usk
-0.63
tagHelperRunner
-0.62
毅
-0.61
AppCompat
-0.61
TableField
-0.60
TagHelpers
-0.60
Natasha
-0.59
POSITIVE LOGITS
denies
1.71
denied
1.71
deny
1.65
denial
1.65
denying
1.56
denied
1.51
Deny
1.49
denial
1.48
Denied
1.48
Denial
1.47
Activations Density 0.128%