INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     romance
    -0.08
     QLabel
    -0.07
     Men
    -0.07
     catastrophe
    -0.07
     Meld
    -0.07
     continue
    -0.07
     Literal
    -0.07
     gratification
    -0.07
     harmed
    -0.07
    Directed
    -0.07
    POSITIVE LOGITS
     banget
    0.10
     yoğun
    0.10
     vigorous
    0.10
    0.09
     unpack
    0.09
     rampant
    0.09
    Packed
    0.09
    unbind
    0.09
     upfront
    0.09
    -packed
    0.08
    Act Density 0.004%

    No Known Activations