INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    IO
    -0.07
    ��
    -0.06
     comprised
    -0.06
     dinner
    -0.06
    oir
    -0.06
     fu
    -0.06
    TRANSFER
    -0.06
     tranqu
    -0.06
    jekt
    -0.06
     ^(
    -0.06
    POSITIVE LOGITS
    ,row
    0.06
    0.06
     Anthrop
    0.06
    .ad
    0.06
     مسلمان
    0.06
     bütün
    0.06
    andard
    0.06
    feature
    0.06
     Of
    0.06
     제가
    0.06
    Act Density 0.002%

    No Known Activations