INDEX
    Explanations

    phrases that indicate honesty or truthfulness

    New Auto-Interp
    Negative Logits
    inz
    -0.15
    illa
    -0.15
    swick
    -0.14
    ardin
    -0.14
     Gallagher
    -0.14
    ange
    -0.14
    ÙĦØ·
    -0.14
    PURE
    -0.14
    press
    -0.14
     ìŀĪëĭ¤ëĬĶ
    -0.14
    POSITIVE LOGITS
    -Sah
    0.17
    ças
    0.15
    éal
    0.15
    aired
    0.15
    亮
    0.14
    auses
    0.14
    atorium
    0.14
    icone
    0.14
    odyn
    0.14
    ¢åįķ
    0.14
    Act Density 0.108%

    No Known Activations