INDEX
    Explanations

    references to physical harm or damage

    New Auto-Interp
    Negative Logits
    lify
    -0.17
    GI
    -0.17
    enty
    -0.15
    shire
    -0.15
    midi
    -0.15
    ?(:
    -0.15
    mie
    -0.15
    خاÙĨÙĩ
    -0.15
    Nİ
    -0.14
    anness
    -0.14
    POSITIVE LOGITS
     done
    0.29
    害
    0.22
     Done
    0.22
    done
    0.21
     DONE
    0.21
     sustained
    0.21
    Done
    0.20
    -done
    0.18
    aceutical
    0.18
    (done
    0.17
    Act Density 0.060%

    No Known Activations