INDEX
    Explanations

    references to falsehoods or deception

    instances of the word "lie" in various contexts

    New Auto-Interp
    Negative Logits
    obs
    -0.75
    alg
    -0.74
    iles
    -0.71
    ugal
    -0.71
    ittens
    -0.70
     liking
    -0.70
    ourning
    -0.68
    aud
    -0.68
    asted
    -0.68
    ilation
    -0.67
    POSITIVE LOGITS
     lie
    1.15
     Lie
    1.02
    lie
    0.93
    Lie
    0.88
    uten
    0.86
     lies
    0.83
    utenant
    0.83
     lied
    0.81
     theoret
    0.81
    ously
    0.80
    Act Density 0.008%

    No Known Activations