INDEX
    Explanations

    references to violence or attacks

    New Auto-Interp
    Negative Logits
    lass
    -0.16
    лÑıв
    -0.16
    legg
    -0.15
     entr
    -0.15
     Islam
    -0.14
     Dans
    -0.14
    Dans
    -0.14
     RECEIVER
    -0.14
    _mgr
    -0.14
    leston
    -0.13
    POSITIVE LOGITS
    tps
    0.16
     rencont
    0.15
     ground
    0.15
    oteca
    0.15
    orts
    0.15
    ż
    0.15
     Bi
    0.14
    éľ
    0.14
    assen
    0.14
    θο
    0.14
    Act Density 0.027%

    No Known Activations