INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     térm
    -0.63
     shown
    -0.62
     Shown
    -0.58
    umumkan
    -0.55
     sœurs
    -0.52
     refiere
    -0.52
     genoux
    -0.52
    ruptedException
    -0.51
     prochaines
    -0.51
     plais
    -0.51
    POSITIVE LOGITS
     that
    0.76
     EconPapers
    0.60
    InjectAttribute
    0.60
    +:+
    0.60
     me
    0.59
     noDo
    0.58
    tovers
    0.56
    ();)
    0.55
     how
    0.54
     us
    0.54
    Act Density 0.031%

    No Known Activations