INDEX
    Explanations

    references to heads and their positions or states

    New Auto-Interp
    Negative Logits
    atz
    -0.16
    lesi
    -0.15
     Hop
    -0.15
     destin
    -0.14
    aji
    -0.14
    ainer
    -0.14
    ways
    -0.14
    jet
    -0.14
    osta
    -0.14
    имо
    -0.14
    POSITIVE LOGITS
     wag
    0.15
    ][(
    0.15
    andon
    0.15
    ichick
    0.14
    ?page
    0.13
    ighbor
    0.13
    ycz
    0.13
    ibold
    0.13
    @(
    0.13
     çĬ
    0.13
    Act Density 0.073%

    No Known Activations