INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ster
    -0.66
    jee
    -0.63
    ature
    -0.62
    por
    -0.61
    uttering
    -0.59
    enburg
    -0.58
    END
    -0.57
    lining
    -0.57
    lich
    -0.56
     favor
    -0.55
    POSITIVE LOGITS
    soever
    1.36
     happens
    1.36
     happened
    1.34
     transpired
    1.22
     constitutes
    1.11
     else
    0.94
     happ
    0.91
     occurs
    0.89
     unfolds
    0.87
     mattered
    0.86
    Act Density 0.098%

    No Known Activations