INDEX
    Explanations

    periods at the end of sentences

    New Auto-Interp
    Negative Logits
     increa
    -2.35
     disagre
    -2.29
     affor
    -2.29
     reluct
    -2.28
     depic
    -2.26
     unwarran
    -2.21
     maneu
    -2.21
     shenan
    -2.18
     viciss
    -2.17
     guarante
    -2.16
    POSITIVE LOGITS
    ↵↵
    1.35
    ↵↵↵
    1.18
    1.10
    ↵↵↵↵
    1.08
    ↵↵↵↵↵
    1.02
    <eos>
    1.02
     And
    0.97
    0.96
     But
    0.92
    </h1>
    0.91
    Act Density 0.213%

    No Known Activations