INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     moist
    -0.07
    _strerror
    -0.06
     articulate
    -0.06
     diver
    -0.06
    freq
    -0.06
    _corpus
    -0.06
     Gee
    -0.06
     Seg
    -0.06
     pj
    -0.06
     Corm
    -0.06
    POSITIVE LOGITS
     watching
    0.15
     watch
    0.14
     Watch
    0.13
     watched
    0.12
    Watch
    0.11
     WATCH
    0.11
    watch
    0.10
    Watching
    0.10
    -watch
    0.09
    _watch
    0.09
    Act Density 0.018%

    No Known Activations