INDEX
    Explanations

    phrases related to deceit or untruthfulness

    New Auto-Interp
    Negative Logits
    afort
    -0.15
    VL
    -0.14
     Rewards
    -0.14
    λά
    -0.14
    zn
    -0.14
    /modules
    -0.14
     kız
    -0.14
    zew
    -0.14
    zes
    -0.14
     watchers
    -0.14
    POSITIVE LOGITS
    osy
    0.16
    ypi
    0.15
    ernote
    0.15
    anza
    0.14
    :host
    0.14
    /block
    0.14
    оваÑĢ
    0.14
    iyan
    0.14
    aÄį
    0.14
    adeon
    0.14
    Act Density 0.054%

    No Known Activations