INDEX
    Explanations

    sections of text that contain no activations, indicating a lack of any significant content

    Code or math-related characters

    New Auto-Interp
    Negative Logits
     […]
    -1.11
     …
    -0.93
    -0.84
    <eos>
    -0.71
     ...
    -0.70
    ↵↵
    -0.67
    […]
    -0.65
    -0.64
    .
    -0.63
    -0.62
    POSITIVE LOGITS
     Савезне
    1.73
     pleaſure
    1.34
     Majefty
    1.29
     purpoſe
    1.28
     myſelf
    1.26
     Мексичка
    1.26
    Personensuche
    1.25
    ſelves
    1.24
    tagHelperRunner
    1.23
    ſelf
    1.23
    Act Density 0.002%

    No Known Activations