INDEX
    Explanations

    phrases related to harm or injury

    New Auto-Interp
    Negative Logits
    ancel
    -0.15
    anten
    -0.14
    žen
    -0.14
    رد
    -0.14
    ÅĤu
    -0.13
    cess
    -0.13
    Pix
    -0.13
    važ
    -0.13
     which
    -0.13
    ãĥĢãĤ¤
    -0.13
    POSITIVE LOGITS
     ÙĪØ§ÙĦتÙĬ
    0.17
    ï¼īãģ®
    0.16
    :;↵
    0.15
    ï¼īçļĦ
    0.15
    @js
    0.15
    lew
    0.15
     sino
    0.15
    ï¼Į以åıĬ
    0.15
    ])->
    0.14
    ')['
    0.14
    Act Density 1.257%

    No Known Activations