INDEX
    Explanations

    high-frequency words and pronouns typically used in discussions

    New Auto-Interp
    Negative Logits
    <eos>
    -0.28
    -0.28
    -0.27
     wynosi
    -0.26
    ambilan
    -0.25
     również
    -0.25
     Grenze
    -0.24
     bluza
    -0.24
    ↵↵
    -0.23
     is
    -0.23
    POSITIVE LOGITS
    <unused41>
    1.00
    <unused79>
    0.99
    [@BOS@]
    0.99
    <unused43>
    0.99
    <unused52>
    0.99
    <unused28>
    0.99
    <unused68>
    0.99
    <unused74>
    0.99
    <unused14>
    0.99
    <unused23>
    0.99
    Act Density 0.157%

    No Known Activations