INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     verilen
    -0.08
     fazê
    -0.08
     carefully
    -0.08
     Carefully
    -0.08
     Viol
    -0.08
     dava
    -0.07
     विजय
    -0.07
     grievance
    -0.07
     lide
    -0.07
    ikhiqizo
    -0.07
    POSITIVE LOGITS
    .blank
    0.09
    ми
    0.09
    0.08
     поб
    0.08
    .lower
    0.08
    enness
    0.08
     أف
    0.08
     puppy
    0.08
    mi
    0.07
    .train
    0.07
    Act Density 0.006%

    No Known Activations