INDEX
    Explanations

    terms related to dishonesty and falsehoods

    New Auto-Interp
    Negative Logits
    mise
    -0.17
    ric
    -0.17
    shal
    -0.17
    ialized
    -0.14
    (att
    -0.14
    .generated
    -0.14
    mor
    -0.14
    scaling
    -0.14
    gaard
    -0.14
    ello
    -0.13
    POSITIVE LOGITS
    /false
    0.24
    ushima
    0.16
    ocrat
    0.16
    fulness
    0.16
     about
    0.16
    HostException
    0.15
    urous
    0.15
    iveness
    0.15
    ulence
    0.15
    itious
    0.14
    Act Density 0.058%

    No Known Activations