INDEX
    Explanations

    concepts related to deception and falsehoods

    New Auto-Interp
    Negative Logits
    .scalablytyped
    -0.24
     both
    -0.15
     Both
    -0.14
    raquo
    -0.14
    Both
    -0.14
    both
    -0.14
    WEEN
    -0.13
     BOTH
    -0.13
    bole
    -0.12
     ;č↵
    -0.12
    POSITIVE LOGITS
     XYZ
    0.42
     xyz
    0.35
    XYZ
    0.33
     X
    0.29
    xyz
    0.26
    æŁIJ
    0.26
     say
    0.25
    say
    0.25
     tomorrow
    0.24
     ABC
    0.24
    Act Density 0.782%

    No Known Activations