INDEX
    Explanations

    correctness or incorrectness

    New Auto-Interp
    Negative Logits
    '
    0.61
    ){
    0.57
    0.57
    '?
    0.50
     are
    0.49
    +
    0.49
    ))
    0.49
     innovate
    0.48
     טוב
    0.48
    }
    0.47
    POSITIVE LOGITS
    correct
    0.75
     Correct
    0.68
    wrong
    0.66
     Incorrect
    0.61
    incorrect
    0.61
    Correct
    0.60
     неправи
    0.59
    正确
    0.59
    Incorrect
    0.58
     incorrect
    0.57
    Act Density 0.126%

    No Known Activations