INDEX
    Explanations

    publications

    New Auto-Interp
    Negative Logits
     pozit
    -0.07
     дви
    -0.07
    _rb
    -0.06
     Offers
    -0.06
     Supporting
    -0.06
     Stephen
    -0.06
     υπό
    -0.06
     WORD
    -0.06
    Direction
    -0.06
     вред
    -0.06
    POSITIVE LOGITS
     Manhattan
    0.08
     compagn
    0.07
     hack
    0.06
    hattan
    0.06
    efeller
    0.06
     Knicks
    0.06
     Rockefeller
    0.06
    )",
    ↵
    0.06
    važ
    0.06
    itan
    0.06
    Act Density 0.127%

    No Known Activations