INDEX
    Explanations

    words associated with deceptive or misleading behavior

    New Auto-Interp
    Negative Logits
     Lal
    -0.18
     Loren
    -0.16
     Lilly
    -0.16
    æ½
    -0.15
    onden
    -0.15
    ÄŁine
    -0.14
     Liquid
    -0.14
     Lair
    -0.14
    нев
    -0.14
     Lowe
    -0.14
    POSITIVE LOGITS
    les
    0.56
    led
    0.52
    ling
    0.45
    LES
    0.44
    ler
    0.43
    lers
    0.38
    lesh
    0.38
    ledo
    0.33
    le
    0.31
    LED
    0.30
    Act Density 0.069%

    No Known Activations