INDEX
    Explanations

    phrases expressing opinions or beliefs

    New Auto-Interp
    Negative Logits
     increa
    -2.82
     emphat
    -2.75
     fta
    -2.71
     guarante
    -2.68
     effe
    -2.67
     squa
    -2.67
     affor
    -2.63
     desir
    -2.62
     mef
    -2.61
     ftu
    -2.61
    POSITIVE LOGITS
     I
    1.22
     if
    1.05
     We
    1.01
     we
    1.00
     If
    0.96
    .
    0.96
    <eos>
    0.94
    if
    0.94
    I
    0.94
     [
    0.93
    Act Density 0.101%

    No Known Activations