INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    accounts
    -0.07
     CST
    -0.07
     Cait
    -0.07
     капіт
    -0.07
     numar
    -0.06
     cela
    -0.06
     персп
    -0.06
    ît
    -0.06
     teplot
    -0.06
    -0.06
    POSITIVE LOGITS
     wrong
    0.17
     Wrong
    0.12
     WRONG
    0.11
    wrong
    0.11
    Wrong
    0.10
     Ways
    0.08
     wrongdoing
    0.08
     wrongly
    0.08
     right
    0.08
     What
    0.07
    Act Density 0.011%

    No Known Activations