INDEX
Explanations
phrases indicating refusal or rejection
New Auto-Interp
Negative Logits
rica
-0.51
View
-0.47
lehe
-0.46
maha
-0.45
CUSSION
-0.44
bitField
-0.43
lehet
-0.43
dover
-0.43
ENTY
-0.42
poj
-0.41
POSITIVE LOGITS
الرياضيه
0.84
IndentedString
0.77
IVEREF
0.76
refusing
0.69
RetentionPolicy
0.68
tagext
0.68
MLLoader
0.67
ViewInit
0.65
obstin
0.65
principalColumn
0.65
Activations Density 0.423%