INDEX
    Explanations

    phrases indicating causality or logical reasoning

    instances of the word "makes."

    New Auto-Interp
    Negative Logits
    rist
    -0.56
    uish
    -0.52
    iped
    -0.51
    phy
    -0.50
     Agency
    -0.50
     Thirty
    -0.50
    orne
    -0.49
    imen
    -0.49
    vez
    -0.49
    wana
    -0.49
    POSITIVE LOGITS
     makes
    2.83
    makes
    2.32
     Makes
    2.16
     gives
    1.93
     creates
    1.90
     helps
    1.82
     distinguishes
    1.81
     lends
    1.76
     brings
    1.75
     proves
    1.73
    Act Density 0.026%

    No Known Activations