INDEX
    Explanations

    references to violence and harmful ideologies, particularly relating to genocide and oppression

    New Auto-Interp
    Negative Logits
    less
    -0.06
    ãĥĬ
    -0.06
    746
    -0.05
    yth
    -0.05
    628
    -0.05
     sparing
    -0.05
    a
    -0.05
    shop
    -0.05
    -less
    -0.05
    aint
    -0.05
    POSITIVE LOGITS
    avou
    0.09
    tuk
    0.09
    apesh
    0.08
    hoot
    0.08
    ersive
    0.08
    eryl
    0.08
    знаÑĩа
    0.08
    -pills
    0.07
    à¸Ļà¸Ħ
    0.07
    bard
    0.07
    Act Density 0.072%

    No Known Activations