INDEX
    Explanations

    references to historical violence and anti-Semitic events

    New Auto-Interp
    Negative Logits
    åĩĿ
    -0.16
    quo
    -0.15
    ptive
    -0.14
    ixel
    -0.14
    ocol
    -0.14
    reon
    -0.14
     rall
    -0.14
    ruba
    -0.14
    uito
    -0.14
    Disposable
    -0.14
    POSITIVE LOGITS
    elib
    0.18
     ill
    0.15
     Gle
    0.15
     McCabe
    0.15
    æĢ§
    0.14
    asca
    0.14
    hle
    0.14
     tongue
    0.14
    ählen
    0.14
    adge
    0.14
    Act Density 0.128%

    No Known Activations