INDEX
    Explanations

    negations and dismissive phrases

    New Auto-Interp
    Negative Logits
    roker
    -0.15
     éº
    -0.14
    ÑĤÑı
    -0.14
    кап
    -0.14
     Deng
    -0.13
    رÙĬب
    -0.13
    thro
    -0.13
     Fury
    -0.13
     Prompt
    -0.13
     modulo
    -0.13
    POSITIVE LOGITS
    ucz
    0.18
    yw
    0.16
    abus
    0.15
    yg
    0.14
    lijah
    0.14
    abb
    0.14
     domest
    0.14
    -hit
    0.14
    wcs
    0.14
     Clem
    0.14
    Act Density 0.021%

    No Known Activations