INDEX
    Explanations

    references to harm or damage in various contexts

    New Auto-Interp
    Negative Logits
    enty
    -0.16
    rick
    -0.16
    lify
    -0.15
    Nİ
    -0.15
    klady
    -0.15
    shire
    -0.14
    .nz
    -0.14
    yı
    -0.14
    bject
    -0.14
    ../../../../
    -0.14
    POSITIVE LOGITS
     sustained
    0.34
     done
    0.33
     Done
    0.28
     DONE
    0.27
    done
    0.27
    Done
    0.26
     sustain
    0.24
    -done
    0.24
    	done
    0.22
     sust
    0.21
    Act Density 0.080%

    No Known Activations