INDEX
Explanations
language related to denial and refusal
New Auto-Interp
Negative Logits
ึ้น
-0.58
SpringRunner
-0.57
AssemblyProduct
-0.56
artament
-0.52
chng
-0.52
Mog
-0.51
Lleva
-0.50
Kasper
-0.50
zure
-0.50
tra
-0.50
POSITIVE LOGITS
refusal
1.56
reject
1.55
rejection
1.53
refuse
1.48
Refuse
1.48
rejects
1.47
denied
1.46
rejected
1.46
rejecting
1.45
Reject
1.45
Activations Density 0.271%