INDEX
    Explanations

    words and phrases indicating significant changes or impacts

    `revealed itself`, `different`, `whether we`, `typical sequences`

    New Auto-Interp
    Negative Logits
     future
    -0.31
    -0.27
    deres
    -0.27
    ext
    -0.25
    -0.25
     précieux
    -0.25
     futurs
    -0.25
    وند
    -0.24
     \
    -0.24
     next
    -0.24
    POSITIVE LOGITS
     فريبيس
    0.82
    <pad>
    0.80
    <unused41>
    0.80
    <unused68>
    0.80
    <unused8>
    0.80
    [@BOS@]
    0.80
    <unused42>
    0.79
    <unused43>
    0.79
    <unused28>
    0.79
    <unused14>
    0.79
    Act Density 0.134%

    No Known Activations