INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    /down
    -0.07
     HALF
    -0.06
     среди
    -0.06
     freedoms
    -0.06
    /games
    -0.06
    -shaped
    -0.06
    ротив
    -0.06
     dalla
    -0.06
     pins
    -0.06
     functioning
    -0.06
    POSITIVE LOGITS
     Solo
    0.08
     solo
    0.08
    优势
    0.07
    وق
    0.07
     «
    0.06
    educ
    0.06
    Solo
    0.06
    лаг
    0.06
     Composer
    0.06
     Ferr
    0.06
    Act Density 0.004%

    No Known Activations