INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    =train
    -0.07
                                                           
    -0.07
    ститут
    -0.06
     disruptive
    -0.06
    190
    -0.06
     enfer
    -0.06
     sabot
    -0.06
    (report
    -0.06
    -0.06
    .Not
    -0.06
    POSITIVE LOGITS
     Alex
    0.10
    Alex
    0.09
     Alexander
    0.08
    Alexander
    0.08
     Alexis
    0.08
     alex
    0.08
    CLASS
    0.07
    LEX
    0.07
    кус
    0.07
     tags
    0.07
    Act Density 0.007%

    No Known Activations