INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    strong
    -0.08
     Thing
    -0.07
     explanation
    -0.07
     manipulation
    -0.07
     сама
    -0.07
     better
    -0.06
     trick
    -0.06
     záznam
    -0.06
     yên
    -0.06
     served
    -0.06
    POSITIVE LOGITS
    الد
    0.06
    yr
    0.06
    ไทย
    0.06
     owed
    0.06
     Franç
    0.06
    nant
    0.06
    .makeText
    0.06
    кав
    0.06
    さんは
    0.06
     Jacksonville
    0.06
    Act Density 0.028%

    No Known Activations