INDEX
    Explanations

    references to attention or focus in various contexts

    New Auto-Interp
    Negative Logits
    rices
    -0.18
    inton
    -0.17
    hack
    -0.17
    hma
    -0.16
    itude
    -0.16
    acker
    -0.16
    imuth
    -0.16
    pora
    -0.15
    Injector
    -0.15
    hood
    -0.15
    POSITIVE LOGITS
    ested
    0.27
    estation
    0.24
    uned
    0.23
    ests
    0.23
    esting
    0.23
    itud
    0.21
    en
    0.21
    aining
    0.21
    a
    0.20
    t
    0.19
    Act Density 0.006%

    No Known Activations