INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    iams
    -0.09
     deserved
    -0.09
    iat
    -0.09
    otent
    -0.09
    aben
    -0.08
     Ard
    -0.08
    mare
    -0.08
    -tank
    -0.08
    yw
    -0.08
    çĨ
    -0.08
    POSITIVE LOGITS
     humans
    0.21
     human
    0.18
    human
    0.15
    人类
    0.15
     performed
    0.14
     ÑĩеловеÑĩеÑģ
    0.13
     Humans
    0.12
     ìĿ¸ê°Ħ
    0.12
    -human
    0.12
    Humans
    0.11
    Act Density 0.034%

    No Known Activations