INDEX
    Explanations

    punctuation marks, particularly periods and parentheses, that indicate the structure of code

    New Auto-Interp
    Negative Logits
     pleaſure
    -0.73
     itſelf
    -0.67
    ſelves
    -0.65
     ſtate
    -0.62
     Inſ
    -0.61
    ſelf
    -0.60
     ſtand
    -0.59
     ſte
    -0.58
     themſelves
    -0.57
     becauſe
    -0.56
    POSITIVE LOGITS
    ').
    1.11
    ").
    1.10
    ()).
    1.03
     '').
    1.00
    "].
    0.97
     ').
    0.96
    ]").
    0.96
    __).
    0.96
     ").
    0.91
     "").
    0.88
    Act Density 0.130%

    No Known Activations