INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    foreign
    -0.32
    Foreign
    -0.31
     Foreign
    -0.31
     Bret
    -0.29
     foreign
    -0.29
    _foreign
    -0.29
    ken
    -0.26
    æ®ĸ
    -0.25
     FOREIGN
    -0.25
    Self
    -0.25
    POSITIVE LOGITS
    éĹ´
    0.29
    ç»Ń约
    0.28
    æľīæĽ´å¥½çļĦ
    0.28
    端
    0.27
    olut
    0.27
    arnation
    0.26
     equ
    0.26
    serter
    0.26
    çļĦæĪ¿åŃIJ
    0.26
    /=
    0.25
    Act Density 0.623%

    No Known Activations