INDEX
    Explanations

    refusal to generate harmful content

    New Auto-Interp
    Negative Logits
     ابن
    0.97
     за
    0.96
    ishu
    0.94
     на
    0.94
    ,
    0.92
    0.91
     далее
    0.89
     далі
    0.89
     in
    0.89
     u
    0.89
    POSITIVE LOGITS
     mennesker
    1.23
    だったら
    1.20
    mDatas
    1.18
     resultContent
    1.17
    क्षर
    1.16
    gameField
    1.15
    multipart
    1.14
     ObjData
    1.13
    ষে
    1.13
     getSize
    1.12
    Act Density 0.042%

    No Known Activations