INDEX
    Explanations

    expressions of logical contradiction, invalidity, or impossibility

    New Auto-Interp
    Negative Logits
    aux
    -0.07
    contri
    -0.07
    istique
    -0.07
    uby
    -0.07
    raq
    -0.06
    Å©
    -0.06
    úa
    -0.06
    ÑĢовод
    -0.06
    ppard
    -0.06
    شت
    -0.06
    POSITIVE LOGITS
     because
    0.08
     for
    0.07
    iglia
    0.07
     whereas
    0.06
    Marco
    0.06
     Fav
    0.06
     Feng
    0.06
     besides
    0.06
     
    0.06
     along
    0.06
    Act Density 0.119%

    No Known Activations