INDEX
    Explanations

    the word "the."

    instances of the word "watch."

    New Auto-Interp
    Negative Logits
    thood
    -0.75
    ucl
    -0.74
    uti
    -0.71
    ccoli
    -0.69
    recy
    -0.69
    eno
    -0.67
    iev
    -0.66
    manship
    -0.65
    trl
    -0.65
    aspberry
    -0.65
    POSITIVE LOGITS
     same
    1.28
     entirety
    1.15
     entire
    1.15
     slightest
    1.14
     latest
    1.11
     smallest
    1.10
     whole
    1.07
    ses
    0.99
     aftermath
    0.98
     vast
    0.98
    Act Density 0.402%

    No Known Activations