INDEX
    Explanations

    percentage values in the text

    New Auto-Interp
    Negative Logits
    unan
    -0.17
    ambi
    -0.16
    acin
    -0.15
    onet
    -0.15
    onaut
    -0.15
    stown
    -0.15
    mant
    -0.15
    deer
    -0.15
    HO
    -0.15
    spiel
    -0.14
    POSITIVE LOGITS
    elow
    0.16
    ilio
    0.16
    ventional
    0.15
    Ñıж
    0.15
    atus
    0.14
    infeld
    0.14
     Dot
    0.14
    596
    0.13
    ourses
    0.13
     blow
    0.13
    Act Density 0.004%

    No Known Activations