INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    bad
    -0.07
     slav
    -0.07
    Lambda
    -0.07
     Pure
    -0.06
     Scientists
    -0.06
    ysics
    -0.06
    aised
    -0.06
     stereotype
    -0.06
     prepare
    -0.06
    ?>><?
    -0.06
    POSITIVE LOGITS
    ductory
    0.07
     lesbische
    0.07
    งส
    0.06
    -going
    0.06
    ,’’
    0.06
     جزء
    0.06
    fecha
    0.06
     نگهداری
    0.06
    0.06
    .teacher
    0.06
    Act Density 0.085%

    No Known Activations