INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     women
    -0.08
     judges
    -0.08
     Dust
    -0.07
     gunman
    -0.07
     Muslims
    -0.07
    -war
    -0.07
     Great
    -0.07
     dialogue
    -0.07
     marsh
    -0.06
     expended
    -0.06
    POSITIVE LOGITS
     свои
    0.10
     свой
    0.08
     сво
    0.08
     위해서
    0.08
     अपन
    0.07
    ична
    0.07
    ographics
    0.07
     своих
    0.07
    自己
    0.07
     своє
    0.07
    Act Density 0.009%

    No Known Activations