INDEX
    Explanations

    references to individuals and their roles or actions within specific contexts

    New Auto-Interp
    Negative Logits
     ONLY
    -0.64
     honestly
    -0.64
     truly
    -0.64
    whatever
    -0.62
     finally
    -0.62
     nevertheless
    -0.62
     nonetheless
    -0.61
    anything
    -0.60
    aten
    -0.60
    only
    -0.59
    POSITIVE LOGITS
    ypes
    0.80
    ebted
    0.72
    sie
    0.71
    racted
    0.70
     Cosponsors
    0.70
     GOODMAN
    0.69
    agonists
    0.67
    urbed
    0.65
    iac
    0.65
    ãĤ©
    0.64
    Act Density 0.217%

    No Known Activations