INDEX
    Explanations

    violent content or behavior

    New Auto-Interp
    Negative Logits
     blame
    1.44
     tink
    1.41
     been
    1.37
    1.36
    1.36
     aloud
    1.34
    -\
    1.33
     worry
    1.33
    been
    1.29
    puts
    1.29
    POSITIVE LOGITS
    ität
    1.73
    د
    1.59
     conformément
    1.54
    ли
    1.43
     dagar
    1.43
    ição
    1.42
    יות
    1.42
     combate
    1.41
    er
    1.39
    رود
    1.38
    Act Density 0.135%

    No Known Activations