INDEX
    Explanations

    negations or negative phrases in the text

    New Auto-Interp
    Negative Logits
     latter
    -0.18
    iaux
    -0.16
    eless
    -0.16
    n
    -0.15
    z
    -0.15
    -sided
    -0.15
    h
    -0.15
    d
    -0.14
    Ñıб
    -0.14
    874
    -0.14
    POSITIVE LOGITS
    ħn
    0.15
    /-
    0.15
    ÑįÑĤомÑĥ
    0.15
     Bris
    0.14
    rador
    0.14
    rası
    0.13
    ADOR
    0.13
    prav
    0.13
    ador
    0.13
    ÏĥÏħ
    0.13
    Act Density 0.067%

    No Known Activations