INDEX
    Explanations

    references to manipulative or deceitful behaviors in relation to political or economic contexts

    New Auto-Interp
    Negative Logits
     通販
    -0.63
    TagMode
    -0.63
     tatuagens
    -0.61
    queryInterface
    -0.60
     staden
    -0.58
     habet
    -0.57
    ApiModelProperty
    -0.56
     pidana
    -0.56
     jambes
    -0.55
     nahilalakip
    -0.53
    POSITIVE LOGITS
    RegressionTest
    0.69
     bullshit
    0.51
     ped
    0.50
     propaganda
    0.49
     obses
    0.49
     blink
    0.48
     BS
    0.48
    ziz
    0.47
    ftagPool
    0.47
    paganda
    0.47
    Act Density 0.879%

    No Known Activations