INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     manners
    -0.27
    emble
    -0.26
     anywhere
    -0.26
    çĿ¹
    -0.25
    çŁ¥éģĵäºĨ
    -0.24
    éĻIJ度
    -0.24
    inct
    -0.24
     Likewise
    -0.24
    ä¸įå¾Ĺå·²
    -0.24
    å§Ĭ妹
    -0.24
    POSITIVE LOGITS
    èIJ½åľ¨
    0.30
    éĩįåĽŀ
    0.28
    orna
    0.28
    adders
    0.28
    oop
    0.27
    }@
    0.27
    uraa
    0.27
    æIJŃ
    0.26
    éĩįçĶŁ
    0.26
    èIJ½åΰ
    0.26
    Act Density 0.005%

    No Known Activations