INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ULK
    -0.06
    _CATEGORY
    -0.06
    ับสน
    -0.06
    825
    -0.06
    (manager
    -0.06
    english
    -0.06
     assortment
    -0.06
    -0.06
     English
    -0.06
     Gus
    -0.06
    POSITIVE LOGITS
     violate
    0.13
     violated
    0.12
     violates
    0.11
     violations
    0.11
     violating
    0.11
     violation
    0.11
    0.08
    _partial
    0.08
    υγ
    0.08
    viol
    0.08
    Act Density 0.013%

    No Known Activations