INDEX
    Explanations

    phrases related to lying or deception

    New Auto-Interp
    Negative Logits
    ugal
    -0.82
    sylv
    -0.74
    ilation
    -0.69
     specificity
    -0.66
    iles
    -0.65
    runs
    -0.65
    night
    -0.63
    arthy
    -0.62
    oso
    -0.61
     Signature
    -0.59
    POSITIVE LOGITS
     dormant
    0.88
     awake
    0.87
    uten
    0.81
    yss
    0.78
    pard
    0.77
    bling
    0.77
    utenant
    0.74
     detector
    0.72
    lie
    0.71
     asleep
    0.70
    Act Density 0.017%

    No Known Activations