INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    Cit
    -0.07
     ctype
    -0.07
     Restoration
    -0.07
    -0.06
     Naruto
    -0.06
     хочу
    -0.06
     govern
    -0.06
     rationale
    -0.06
    ate
    -0.06
    	Mat
    -0.06
    POSITIVE LOGITS
     theorem
    0.07
    uitar
    0.06
     subt
    0.06
     vál
    0.06
     poster
    0.06
     skl
    0.06
     by
    0.06
     oleh
    0.06
    北京
    0.06
     instancia
    0.06
    Act Density 0.029%

    No Known Activations