INDEX
    Explanations

    negative/harmful content

    New Auto-Interp
    Negative Logits
    -0.07
     влади
    -0.07
     Giá
    -0.06
     decrease
    -0.06
    Arg
    -0.06
     {-
    -0.06
     chocol
    -0.06
    -ms
    -0.06
    -exec
    -0.06
    าระ
    -0.06
    POSITIVE LOGITS
    Classifier
    0.07
     фут
    0.07
     उसन
    0.06
    ترة
    0.06
     unsafe
    0.06
     advertised
    0.06
     cum
    0.06
     paradise
    0.06
     Somali
    0.06
     collusion
    0.06
    Act Density 0.008%

    No Known Activations