INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     while
    -1.29
     While
    -1.10
     (
    -1.05
     before
    -1.04
     when
    -1.01
     exuberant
    -0.92
     особли
    -0.90
     Киє
    -0.90
     for
    -0.88
     at
    -0.88
    POSITIVE LOGITS
    1.42
    1.32
    1.18
     impon
    1.18
    1.16
     绒
    1.14
    1.13
    1.13
    1.10
    1.09
    Act Density 0.001%

    No Known Activations