INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Hannah
    -0.07
    (comment
    -0.07
     impressive
    -0.07
     "),
    -0.07
    antaged
    -0.07
     Kath
    -0.07
     Listener
    -0.06
     Tuesday
    -0.06
    -control
    -0.06
     varying
    -0.06
    POSITIVE LOGITS
    ilo
    0.22
    elo
    0.14
    alo
    0.14
     Milo
    0.13
    o
    0.10
     Angelo
    0.09
    ilos
    0.09
    ilon
    0.08
    ило
    0.08
     Halo
    0.07
    Act Density 0.008%

    No Known Activations