INDEX
Explanations
Failure of AI or code
assistant/model responses that provide structured explanations or evaluations—especially noting flaws, limitations, or following task instructions.
New Auto-Interp
Negative Logits
positive
0.72
enrich
0.69
enriching
0.69
stär
0.69
enrich
0.68
exhilarating
0.66
Enh
0.66
favorable
0.65
enhancing
0.65
喜爱
0.64
POSITIVE LOGITS
useless
1.72
failed
1.68
ineffective
1.64
failed
1.59
incapable
1.55
futile
1.53
pointless
1.53
Failed
1.50
worthless
1.49
unsuccessful
1.48
Activations Density 3.409%