INDEX
    Explanations

    words related to deception and falsehoods

    New Auto-Interp
    Negative Logits
    nels
    -0.17
    mente
    -0.16
    core
    -0.16
    rik
    -0.15
    .scalablytyped
    -0.15
    rost
    -0.15
    bos
    -0.15
    tle
    -0.15
     wholly
    -0.15
    ially
    -0.15
    POSITIVE LOGITS
    ÌĪ
    0.20
    keepers
    0.18
    readcr
    0.17
    theast
    0.17
    xygen
    0.17
    yssey
    0.17
    ys
    0.16
    thing
    0.16
    otros
    0.15
    ãĤ©
    0.15
    Act Density 0.559%

    No Known Activations