INDEX
Explanations
Key improvements and explanations
New Auto-Interp
Negative Logits
disadvantage
0.46
corrected
0.44
correcting
0.42
verification
0.42
fastest
0.41
dangerous
0.40
dominant
0.39
proving
0.39
budgeting
0.39
unified
0.39
POSITIVE LOGITS
Key
0.92
Key
0.78
explanations
0.78
Explanation
0.77
key
0.75
key
0.73
KEY
0.70
Explanation
0.68
Explain
0.66
Highlights
0.65
Activations Density 0.021%