INDEX
    Explanations

    harmful content and instructions

    New Auto-Interp
    Negative Logits
     provides
    0.32
     aids
    0.32
    專業
    0.29
     aided
    0.28
     Deluxe
    0.28
     hopefully
    0.27
     Made
    0.27
    &#
    0.27
     Provides
    0.27
     Gaff
    0.26
    POSITIVE LOGITS
     полити
    0.31
    defeated
    0.31
    proble
    0.30
     gestire
    0.30
    ുമോ
    0.30
    统治
    0.30
     sbagli
    0.30
     politique
    0.30
     accusation
    0.30
     kuhusu
    0.30
    Act Density 0.001%

    No Known Activations