INDEX
    Explanations

    programmed to refuse harmful requests

    New Auto-Interp
    Negative Logits
    Wend
    0.47
     нен
    0.40
     мет
    0.40
     Vend
    0.38
    wend
    0.37
    Adi
    0.37
     السالب
    0.36
    SK
    0.36
    Ժ
    0.36
    skip
    0.35
    POSITIVE LOGITS
     cubes
    0.38
    শনে
    0.37
     Saha
    0.37
     સુ
    0.36
     STAR
    0.36
     pug
    0.35
     Keenan
    0.35
     PANEL
    0.35
     evaporated
    0.35
     Kiv
    0.35
    Act Density 0.004%

    No Known Activations