INDEX
    Explanations

    mentions of toxicity and “toxic behavior,” especially in moderation or refusal statements.

    New Auto-Interp
    Negative Logits
     часом
    -0.09
    běh
    -0.07
     друга
    -0.07
    ++;↵↵
    -0.07
     evenings
    -0.07
     evening
    -0.07
     Clothing
    -0.06
     lors
    -0.06
     آلة
    -0.06
     ринку
    -0.06
    POSITIVE LOGITS
     пост
    0.06
     difficile
    0.06
     Schwe
    0.06
    0.06
     Feinstein
    0.06
    +'_
    0.06
     Contract
    0.06
    ASC
    0.06
    =str
    0.06
    (float
    0.05
    Act Density 0.009%

    No Known Activations