INDEX
    Explanations

    actions involving physical violence or aggression

    New Auto-Interp
    Negative Logits
    ặt
    -0.15
     unh
    -0.15
     unb
    -0.15
    igne
    -0.15
    ož
    -0.14
    ç©
    -0.14
    atik
    -0.14
     Bread
    -0.14
     Bever
    -0.14
     bread
    -0.13
    POSITIVE LOGITS
    holm
    0.16
    OE
    0.14
    uty
    0.14
    oste
    0.14
    ologie
    0.14
    ëį
    0.14
    uten
    0.14
    éĻ£
    0.14
    zel
    0.14
    onne
    0.14
    Act Density 0.178%

    No Known Activations