INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Gaga
    -0.07
    ire
    -0.06
     hải
    -0.06
    sea
    -0.06
     Truly
    -0.06
    AsString
    -0.06
    oving
    -0.06
    :',
    -0.06
    ovo
    -0.06
     Command
    -0.06
    POSITIVE LOGITS
    utherford
    0.12
     Modi
    0.10
     Santo
    0.08
     Newton
    0.07
     composers
    0.07
    chers
    0.07
     Newtown
    0.07
     props
    0.07
     problematic
    0.06
    UFACT
    0.06
    Act Density 0.007%

    No Known Activations