INDEX
    Explanations

    phrases indicating emphasis or persuasion

    expressions of belief or trust

    New Auto-Interp
    Negative Logits
    imilar
    -0.74
     sidel
    -0.71
    idian
    -0.70
    advant
    -0.64
    wich
    -0.64
    ouk
    -0.61
    ipment
    -0.60
    erness
    -0.58
    erto
    -0.58
    iri
    -0.57
    POSITIVE LOGITS
     Yourself
    0.65
     hype
    0.62
     admit
    0.62
    zers
    0.61
     me
    0.61
     expr
    0.60
     WHEN
    0.60
    !:
    0.59
     deceive
    0.59
     Twice
    0.58
    Act Density 0.102%

    No Known Activations