INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    anh
    -0.08
    quir
    -0.07
     Monument
    -0.07
     types
    -0.07
     heterogeneous
    -0.07
    ोजन
    -0.07
    োজন
    -0.07
    ền
    -0.07
     мор
    -0.07
     Annex
    -0.07
    POSITIVE LOGITS
     wat
    0.09
     verbessert
    0.09
     Beginner
    0.08
     Improved
    0.08
    Impro
    0.08
     ideaal
    0.08
     knih
    0.08
     respectable
    0.08
     отлично
    0.08
     niz
    0.08
    Act Density 0.006%

    No Known Activations