INDEX
    Explanations

    references to violence and its consequences

    New Auto-Interp
    Negative Logits
    orne
    -0.16
    onya
    -0.15
    RAP
    -0.15
    esty
    -0.15
    autocomplete
    -0.15
    otos
    -0.14
    bilir
    -0.14
    anus
    -0.14
    _motion
    -0.14
    ancel
    -0.14
    POSITIVE LOGITS
     violence
    0.45
     Violence
    0.36
     violent
    0.34
    viol
    0.33
    -viol
    0.31
     resort
    0.29
    Viol
    0.28
     physical
    0.26
    violent
    0.26
     viol
    0.24
    Act Density 0.245%

    No Known Activations