INDEX
    Explanations

    exercising control

    New Auto-Interp
    Negative Logits
     rien
    -0.06
     indicated
    -0.06
     بول
    -0.06
     )↵
    -0.06
     bore
    -0.06
    ())↵
    -0.06
     Mayor
    -0.06
     cant
    -0.06
     ubuntu
    -0.06
    -0.06
    POSITIVE LOGITS
    /drivers
    0.07
    .Val
    0.06
    0.06
    first
    0.06
     intellect
    0.06
    (second
    0.06
     usa
    0.06
    agnost
    0.06
    0.06
    :↵↵↵↵↵↵
    0.06
    Act Density 0.028%

    No Known Activations