INDEX
    Explanations

    code, actions, and attention

    New Auto-Interp
    Negative Logits
    2
    0.94
     Process
    0.90
    4
    0.90
    1
    0.84
     Hundreds
    0.81
    PI
    0.79
    8
    0.79
    9
    0.79
     Constructed
    0.78
    7
    0.77
    POSITIVE LOGITS
     ominous
    0.91
     juegos
    0.87
    0.87
     agrav
    0.87
     financiera
    0.84
     juegan
    0.84
    ны
    0.82
     smoky
    0.82
     viajes
    0.82
     horned
    0.80
    Act Density 0.001%

    No Known Activations