INDEX
    Explanations

    references to torture and suffering

    New Auto-Interp
    Negative Logits
    UnderTest
    -0.15
    ardon
    -0.15
    lining
    -0.14
    ê·Ģ
    -0.14
    yg
    -0.14
    ially
    -0.14
    hiro
    -0.14
    اØŃÛĮ
    -0.14
    alties
    -0.13
    ikat
    -0.13
    POSITIVE LOGITS
    ofil
    0.15
    /plain
    0.15
     Plain
    0.14
    inct
    0.14
    oenix
    0.14
    باØŃ
    0.14
    ANTI
    0.14
    lixir
    0.13
    aylight
    0.13
     bev
    0.13
    Act Density 0.010%

    No Known Activations