INDEX
    Explanations

    biased/unbiased

    New Auto-Interp
    Negative Logits
     atrocities
    -0.08
     Mass
    -0.08
     cater
    -0.08
     bott
    -0.07
     authoritarian
    -0.07
     Militar
    -0.07
    ्ख
    -0.07
     foul
    -0.07
     taht
    -0.07
    -os
    -0.07
    POSITIVE LOGITS
    _receiver
    0.08
    _exchange
    0.08
     undercover
    0.08
    تق
    0.08
     تق
    0.08
     récupérer
    0.07
     대신
    0.07
     السفر
    0.07
     ingen
    0.07
     Nuna
    0.07
    Act Density 0.001%

    No Known Activations