INDEX
    Explanations

    references to agents in various contexts

    New Auto-Interp
    Negative Logits
    оби
    -0.17
    678
    -0.15
    ara
    -0.15
    yat
    -0.15
    rk
    -0.15
    erd
    -0.15
    ستاÙĨ
    -0.15
    ble
    -0.15
    ux
    -0.14
    erras
    -0.14
    POSITIVE LOGITS
    nesty
    0.18
    .Agent
    0.17
     provoc
    0.16
    inel
    0.15
    urons
    0.15
    415
    0.15
    ooled
    0.15
    apor
    0.15
    bab
    0.14
    otts
    0.14
    Act Density 0.011%

    No Known Activations