INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     da
    0.77
     su
    0.65
     s
    0.65
     ca
    0.61
     pie
    0.60
     explain
    0.60
     target
    0.59
     w
    0.57
     ch
    0.57
     pant
    0.56
    POSITIVE LOGITS
    <unused1797>
    1.23
    <unused1999>
    1.16
    <unused1857>
    1.10
    <unused1037>
    1.06
    <unused1919>
    1.02
    <unused1886>
    0.99
    <unused1726>
    0.97
    0.96
    𝚇
    0.96
    <unused1710>
    0.95
    Act Density 0.132%

    No Known Activations