INDEX
    Explanations

    references to specific experimental details and clarifications

    New Auto-Interp
    Negative Logits
     στα
    -0.36
     occasionally
    -0.31
    ようになります
    -0.27
    m
    -0.26
    .
    -0.25
     tròn
    -0.24
     alami
    -0.24
    -0.24
    ↵↵
    -0.24
     temporarily
    -0.24
    POSITIVE LOGITS
     ſelbſt
    0.88
     ſind
    0.87
    <unused8>
    0.86
    <unused41>
    0.85
    <unused79>
    0.85
    <unused14>
    0.85
    <unused52>
    0.85
    <unused68>
    0.85
    [@BOS@]
    0.85
    <unused16>
    0.85
    Act Density 0.719%

    No Known Activations