INDEX
    Explanations

    terms related to physical harm and damage

    New Auto-Interp
    Negative Logits
    enty
    -0.17
    fty
    -0.17
    GI
    -0.17
    anness
    -0.16
    izens
    -0.15
    lify
    -0.15
    roc
    -0.15
    ../../../../
    -0.15
    .nz
    -0.15
    init
    -0.15
    POSITIVE LOGITS
    害
    0.21
     done
    0.20
    proof
    0.17
     sustained
    0.17
    aceutical
    0.17
    lessly
    0.16
    /dist
    0.16
    full
    0.15
    fully
    0.15
    lijke
    0.15
    Act Density 0.058%

    No Known Activations