INDEX
    Explanations

    references to physical harm or injury

    New Auto-Interp
    Negative Logits
    roc
    -0.19
    ander
    -0.18
    ÙĨØ´
    -0.15
    irie
    -0.15
    rose
    -0.15
    intptr
    -0.14
    оÑģÑĮ
    -0.14
    ÅĻ
    -0.14
    หมาย
    -0.14
     dõi
    -0.14
    POSITIVE LOGITS
    害
    0.18
    ollen
    0.16
    hur
    0.15
    år
    0.14
    alink
    0.14
    물ìĿĦ
    0.14
    dictions
    0.14
    fal
    0.14
    idders
    0.14
    eut
    0.14
    Act Density 0.067%

    No Known Activations