INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    herself
    -0.76
    our
    -0.66
     ourselves
    -0.65
    his
    -0.62
    OUR
    -0.61
     Your
    -0.60
     hans
    -0.60
     herself
    -0.59
     mich
    -0.59
    himself
    -0.59
    POSITIVE LOGITS
     autorytatywna
    1.15
     ujednoznacz
    1.13
    ValueStyle
    1.10
    Datuak
    1.06
     EconPapers
    1.04
    帖最后由
    1.01
     esternos
    0.99
    tvguidetime
    0.98
     @"/
    0.96
    \{\\
    0.95
    Act Density 0.761%

    No Known Activations