INDEX
    Explanations

    negation expressions or indicators of falsehood in logical statements

    New Auto-Interp
    Negative Logits
     שוליים
    -0.68
     zwiſchen
    -0.68
     queſta
    -0.67
     メンテナ
    -0.67
    Personendaten
    -0.66
    <unused28>
    -0.65
    <unused47>
    -0.65
    <unused23>
    -0.65
    [@BOS@]
    -0.65
    <unused3>
    -0.65
    POSITIVE LOGITS
    =!
    0.80
     !
    0.77
    (!
    0.71
     (!
    0.68
    !
    0.65
     ((!
    0.60
    {!
    0.58
    0.53
     {!
    0.52
    [!
    0.51
    Act Density 0.007%

    No Known Activations