INDEX
    Explanations

    questions posed in the text

    New Auto-Interp
    Negative Logits
    entar
    -0.16
    errar
    -0.15
    ented
    -0.15
    ent
    -0.14
    iller
    -0.14
    ifter
    -0.14
    ublisher
    -0.14
    omor
    -0.14
    anity
    -0.13
    ãn
    -0.13
    POSITIVE LOGITS
     better
    0.31
    better
    0.23
     mejor
    0.21
     else
    0.20
     could
    0.19
    Better
    0.18
     more
    0.17
     Better
    0.17
     do
    0.17
    could
    0.16
    Act Density 0.043%

    No Known Activations