INDEX
    Explanations

    phrases indicating clarification or emphasis

    phrases structured around the concept of "being" or existence

    New Auto-Interp
    Negative Logits
    yip
    -0.81
    naires
    -0.65
    rador
    -0.61
     Lines
    -0.58
    atis
    -0.57
    wi
    -0.56
     Bung
    -0.56
     nap
    -0.56
     rows
    -0.56
    ria
    -0.55
    POSITIVE LOGITS
     honest
    1.12
     sure
    1.11
     blunt
    1.03
     able
    0.99
     frank
    0.93
     fair
    0.85
    heading
    0.85
     eligible
    0.82
     truthful
    0.81
     careful
    0.79
    Act Density 0.048%

    No Known Activations