INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     egreg
    -0.07
    wyn
    -0.07
     pubs
    -0.07
    통신
    -0.07
    -0.07
     negatives
    -0.07
    近代
    -0.07
    赞叹
    -0.07
    обра
    -0.06
    考試
    -0.06
    POSITIVE LOGITS
    Block
    0.07
    _both
    0.07
    0.07
    ULA
    0.07
     bike
    0.07
    Support
    0.06
     cliente
    0.06
    Book
    0.06
     helicopter
    0.06
    +t
    0.06
    Act Density 0.004%

    No Known Activations