INDEX
    Explanations

    references to attention or attentiveness

    New Auto-Interp
    Negative Logits
    лиÑĨ
    -0.20
    idding
    -0.17
    ierge
    -0.17
    zes
    -0.17
    abra
    -0.16
    off
    -0.16
    utz
    -0.15
    chte
    -0.15
    offs
    -0.15
    ould
    -0.15
    POSITIVE LOGITS
    itudes
    0.29
    orney
    0.27
    itude
    0.26
    itud
    0.24
    orneys
    0.24
    ENTION
    0.24
    uned
    0.23
    acks
    0.23
    acked
    0.21
    acking
    0.21
    Act Density 0.012%

    No Known Activations