INDEX
    Explanations

    connections and interactions between actions and their consequences

    New Auto-Interp
    Negative Logits
    actions
    -0.15
    egot
    -0.15
    ae
    -0.14
    alk
    -0.14
     bolt
    -0.14
    ahn
    -0.14
     self
    -0.14
    self
    -0.14
     Past
    -0.14
    running
    -0.14
    POSITIVE LOGITS
    리ì¦Ī
    0.14
    iloc
    0.14
    IFT
    0.14
    936
    0.14
    èĨ
    0.14
    olina
    0.14
    èįIJ
    0.13
    ERIC
    0.13
    arena
    0.13
    лоб
    0.13
    Act Density 0.196%

    No Known Activations