INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Nelson
    -0.08
     Marshall
    -0.07
     Lah
    -0.07
     plung
    -0.07
    ORD
    -0.07
     FUNCTIONS
    -0.07
     sj
    -0.06
     lbs
    -0.06
     metabolism
    -0.06
     attributed
    -0.06
    POSITIVE LOGITS
     fake
    0.16
    fake
    0.13
     Fake
    0.13
    Fake
    0.12
    .fake
    0.08
    (fake
    0.08
     poke
    0.08
     Fak
    0.08
    _fake
    0.08
     एक
    0.07
    Act Density 0.004%

    No Known Activations