INDEX
    Explanations

    phrases describing hidden agendas or deceptive actions

    instances of deceptive language or misrepresentation

    New Auto-Interp
    Negative Logits
     Sob
    -0.66
    chenko
    -0.62
    gra
    -0.61
    bent
    -0.60
    loads
    -0.60
     Survey
    -0.59
    ozy
    -0.59
    cedes
    -0.58
    below
    -0.58
     polled
    -0.58
    POSITIVE LOGITS
     innocuous
    0.92
     invincible
    0.79
     innocence
    0.78
    OPA
    0.76
     UL
    0.74
     benign
    0.72
     harmless
    0.66
     rud
    0.63
     neutrality
    0.62
    regn
    0.61
    Act Density 0.741%

    No Known Activations