INDEX
    Explanations

    words related to deception or falsehood

    New Auto-Interp
    Negative Logits
    es
    -0.18
    eval
    -0.18
    ylvania
    -0.17
    lle
    -0.16
    laus
    -0.16
    ee
    -0.15
    little
    -0.15
    ele
    -0.15
    i
    -0.14
    eko
    -0.14
    POSITIVE LOGITS
    ardy
    0.18
    овеÑĢ
    0.17
    quete
    0.17
    ÑĪив
    0.16
    lover
    0.15
    rus
    0.15
     krat
    0.14
    coholic
    0.14
    ÙĨاÙħÙĩ
    0.14
     Fall
    0.14
    Act Density 0.010%

    No Known Activations