INDEX
    Explanations

    mentions of lying or related terms

    expressions related to dishonesty and deceit

    New Auto-Interp
    Negative Logits
    aldi
    -0.85
    ugal
    -0.76
    joining
    -0.76
    orsi
    -0.75
    allows
    -0.74
    FN
    -0.72
    ains
    -0.69
    illed
    -0.69
    obs
    -0.69
    ategory
    -0.69
    POSITIVE LOGITS
     detector
    1.01
    uten
    0.94
     detectors
    0.77
     deceit
    0.76
    utenant
    0.76
     vulner
    0.74
     liar
    0.74
     misrepresent
    0.73
    acies
    0.73
     deceive
    0.72
    Act Density 0.021%

    No Known Activations