INDEX
    Explanations

    contrasts between opposing viewpoints or groups

    New Auto-Interp
    Negative Logits
    edin
    -0.16
    unas
    -0.15
     Î
    -0.15
    RIORITY
    -0.15
    ola
    -0.14
    IENTATION
    -0.14
    chaft
    -0.14
    idd
    -0.14
    彩
    -0.14
     Cornel
    -0.14
    POSITIVE LOGITS
    905
    0.15
    isha
    0.15
     decl
    0.14
    aggio
    0.14
    μÏĮ
    0.14
    ãĤį
    0.14
     fur
    0.14
     Ùħباش
    0.13
     classical
    0.13
    atters
    0.13
    Act Density 0.158%

    No Known Activations