INDEX
Explanations
correctness or incorrectness
New Auto-Interp
Negative Logits
'
0.61
){0.57
’
0.57
'?
0.50
are
0.49
+
0.49
))
0.49
innovate
0.48
טוב
0.48
}
0.47
POSITIVE LOGITS
correct
0.75
Correct
0.68
wrong
0.66
Incorrect
0.61
incorrect
0.61
Correct
0.60
неправи
0.59
正确
0.59
Incorrect
0.58
incorrect
0.57
Activations Density 0.126%