INDEX
    Explanations

    words related to dehumanization and its effects

    New Auto-Interp
    Negative Logits
    yt
    -0.17
    ames
    -0.16
    itsu
    -0.15
    YT
    -0.15
    wei
    -0.15
    дов
    -0.14
    lian
    -0.14
    anford
    -0.14
    reme
    -0.14
    ittle
    -0.14
    POSITIVE LOGITS
     facto
    0.18
    urgeon
    0.15
     rig
    0.15
     de
    0.15
    æľĭ
    0.15
     Decomp
    0.15
    æŃ
    0.14
    ัà¸ģà¸Ĺ
    0.14
    ognito
    0.14
    eam
    0.14
    Act Density 0.040%

    No Known Activations