INDEX
Explanations
expressions of denial or refusal
New Auto-Interp
Negative Logits
AppCompat
-0.71
गत
-0.68
‘
-0.63
Matth
-0.60
ไง
-0.59
姆斯
-0.59
artament
-0.59
Schw
-0.59
duct
-0.58
hoga
-0.58
POSITIVE LOGITS
Deny
1.65
denies
1.54
deny
1.52
denial
1.42
denied
1.39
Denial
1.39
denying
1.38
deny
1.36
Denied
1.36
Deny
1.32
Activations Density 0.035%