INDEX
    Explanations

    phrases indicating "in other words."

    New Auto-Interp
    Negative Logits
    apego
    -0.67
    yip
    -0.66
    atism
    -0.64
     overcame
    -0.61
    avorite
    -0.59
     outweigh
    -0.58
    icides
    -0.58
    Always
    -0.57
    iste
    -0.56
    achelor
    -0.56
    POSITIVE LOGITS
    words
    1.11
    worldly
    1.06
     words
    1.04
     respects
    0.94
    wise
    0.86
     contexts
    0.85
    word
    0.83
     instances
    0.80
     circumstances
    0.78
     areas
    0.77
    Act Density 0.016%

    No Known Activations