INDEX
    Explanations

    references to physical actions and their consequences

    New Auto-Interp
    Negative Logits
    ollen
    -0.15
    allo
    -0.14
     ADV
    -0.13
     syn
    -0.13
     Sug
    -0.13
    orb
    -0.13
     Modern
    -0.13
    заб
    -0.13
    fal
    -0.13
    illa
    -0.13
    POSITIVE LOGITS
    _CID
    0.16
    Ñĩе
    0.15
    éĩ
    0.15
    usch
    0.15
    üf
    0.14
    iy
    0.14
    ÙIJÙĩ
    0.13
    kop
    0.13
    uggy
    0.13
    czy
    0.13
    Act Density 0.054%

    No Known Activations