INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    Contribution
    -0.09
     Contribution
    -0.09
    🏼
    -0.08
     Chow
    -0.08
    🏻
    -0.08
     contribution
    -0.07
     License
    -0.07
     Label
    -0.07
     Ay
    -0.07
     Guz
    -0.07
    POSITIVE LOGITS
     зап
    0.08
    0.08
     ill
    0.08
    hle
    0.08
    Enumer
    0.08
     threatening
    0.07
     minister
    0.07
     الط
    0.07
     entertaining
    0.07
     PROF
    0.07
    Act Density 0.000%

    No Known Activations