INDEX
    Explanations

    punctuation marks, particularly commas

    New Auto-Interp
    Negative Logits
    abay
    -0.18
    ace
    -0.17
    oro
    -0.17
    ieres
    -0.16
    dit
    -0.15
    orc
    -0.15
    ACE
    -0.15
    aka
    -0.14
    BUR
    -0.14
    oit
    -0.14
    POSITIVE LOGITS
    untu
    0.15
     wr
    0.15
    colm
    0.15
    ê¸ī
    0.14
    643
    0.14
    ongs
    0.14
     ì¶ľìŀ¥
    0.14
    _tD
    0.14
    _tF
    0.14
    ines
    0.14
    Act Density 0.065%

    No Known Activations