INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     success
    -0.28
    æĪIJåĬŁ
    -0.27
    åĪĽå§ĭ
    -0.27
    addAction
    -0.26
     invent
    -0.25
    -winning
    -0.25
    åĪĴ
    -0.25
    æ®ļ
    -0.25
     pen
    -0.25
    pen
    -0.24
    POSITIVE LOGITS
    èĢĮä¸įæĺ¯
    0.28
    ç®ĬæĥħåĨµ
    0.27
    èĢĮéĿŀ
    0.26
    orne
    0.26
     rather
    0.25
    rather
    0.25
    kills
    0.25
    dfd
    0.24
     MILL
    0.24
    urr
    0.24
    Act Density 0.041%

    No Known Activations