INDEX
    Explanations

    phrases indicating truthfulness or fairness

    phrases that express honesty and truthfulness

    New Auto-Interp
    Negative Logits
    chairs
    -0.71
    urated
    -0.62
    gotten
    -0.61
    into
    -0.61
    boro
    -0.60
    colored
    -0.60
    chair
    -0.59
    ende
    -0.59
    worth
    -0.57
    bled
    -0.57
    POSITIVE LOGITS
     however
    1.02
     though
    1.00
     tho
    0.77
    adays
    0.76
    nown
    0.73
     there
    0.72
     meanwhile
    0.70
    pter
    0.65
     it
    0.64
     neither
    0.64
    Act Density 0.143%

    No Known Activations