INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Fraud
    -0.09
    fak
    -0.09
     Bere
    -0.08
    _traits
    -0.08
     fraud
    -0.08
     Synd
    -0.08
    æ¬
    -0.08
    intr
    -0.08
     درجÙĩ
    -0.08
     èŃ
    -0.08
    POSITIVE LOGITS
     original
    0.17
     hate
    0.15
     statement
    0.13
     speech
    0.13
    original
    0.13
    (original
    0.12
     message
    0.12
     Hate
    0.12
     initial
    0.12
     argument
    0.11
    Act Density 0.058%

    No Known Activations