INDEX
    Explanations

    phrases indicating size or significance

    New Auto-Interp
    Negative Logits
    elman
    -0.16
    dlg
    -0.16
    715
    -0.15
    aan
    -0.15
    595
    -0.15
    ibu
    -0.15
    abis
    -0.15
    ing
    -0.14
    an
    -0.14
    inger
    -0.14
    POSITIVE LOGITS
    ULD
    0.16
     Jaune
    0.16
    ouver
    0.14
    adele
    0.14
    bern
    0.14
    ¶
    0.13
    uptools
    0.13
    vit
    0.13
     ”↵↵
    0.13
     éĸ¢éĢ£
    0.13
    Act Density 0.017%

    No Known Activations