INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     heterosexual
    -0.09
     vestib
    -0.08
     Steam
    -0.08
     '~
    -0.08
     leagues
    -0.08
     gay
    -0.08
     runt
    -0.08
     escorts
    -0.08
     semif
    -0.08
     "~
    -0.08
    POSITIVE LOGITS
     coefficients
    0.11
    coeff
    0.11
    Polynomial
    0.11
     polynomial
    0.10
     Polynomial
    0.10
     Fourier
    0.10
    Coe
    0.09
    _coeff
    0.09
     vanish
    0.09
     coeff
    0.09
    Act Density 0.034%

    No Known Activations