INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     outr
    -0.30
     Decomp
    -0.27
    å¤įèĭı
    -0.26
    åĪĨæĶ¯
    -0.26
    以åħ¶
    -0.25
     Branch
    -0.25
    Ranked
    -0.25
    设计æĸ¹æ¡Ī
    -0.25
    éĢĦ
    -0.24
    stay
    -0.24
    POSITIVE LOGITS
     oneself
    0.40
    ermen
    0.28
    æĸĩ
    0.27
    -done
    0.26
    erman
    0.25
    ãĥ¼ãĥĪ
    0.25
    åĽ½å¤ĸ
    0.25
    amen
    0.24
    imes
    0.24
     yourself
    0.24
    Act Density 0.007%

    No Known Activations