INDEX
    Explanations

    dishonesty and lies

    New Auto-Interp
    Negative Logits
     Cure
    -0.09
     watch
    -0.08
     Watch
    -0.08
    ાન
    -0.08
     watching
    -0.08
     lockers
    -0.08
    لاس
    -0.07
     teme
    -0.07
    Watch
    -0.07
     casser
    -0.07
    POSITIVE LOGITS
     truthful
    0.15
     misleading
    0.13
     misinformation
    0.13
     truth
    0.13
     deceit
    0.13
     deceptive
    0.13
     deceive
    0.12
     truths
    0.12
     deception
    0.11
     fraudulent
    0.11
    Act Density 0.084%

    No Known Activations