INDEX
    Explanations

    words related to disagreement or separation

    New Auto-Interp
    Negative Logits
    oad
    -0.15
    ewe
    -0.15
    -duty
    -0.15
    emann
    -0.14
     Amar
    -0.14
    iasm
    -0.14
    ستÛĮ
    -0.14
    udes
    -0.14
    olecules
    -0.14
    703
    -0.14
    POSITIVE LOGITS
    eson
    0.18
    hole
    0.17
    ling
    0.16
    abled
    0.16
    ery
    0.16
    ERY
    0.15
     Howe
    0.15
    tir
    0.15
    entials
    0.15
    enden
    0.15
    Act Density 0.052%

    No Known Activations