INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    Powered
    -0.07
    ubble
    -0.07
     béné
    -0.07
     pretrained
    -0.07
    ijan
    -0.07
    pag
    -0.07
     gt
    -0.07
    usap
    -0.07
     pare
    -0.07
     pprint
    -0.07
    POSITIVE LOGITS
     disrespect
    0.15
     наруш
    0.12
     violates
    0.10
     jeopard
    0.10
     underm
    0.10
     disrupt
    0.10
     terhadap
    0.10
     disrupting
    0.09
     offend
    0.09
     нарушения
    0.09
    Act Density 0.029%

    No Known Activations