INDEX
Explanations
terms related to denial or refusal
New Auto-Interp
Negative Logits
Rejection
-0.78
Erreferentziak
-0.75
Rejection
-0.74
rejection
-0.70
Reject
-0.70
Rejected
-0.68
تقاوى
-0.67
قایناقلار
-0.67
throwaway
-0.67
rejet
-0.66
POSITIVE LOGITS
math
1.01
deny
0.98
denied
0.86
denying
0.85
denies
0.77
denial
0.68
Math
0.66
math
0.63
WebServlet
0.61
Math
0.59
Activations Density 0.119%