INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     spion
    -0.98
     frau
    -0.96
     underpin
    -0.93
     psychiat
    -0.93
     Márquez
    -0.92
     sezonu
    -0.91
     horloge
    -0.91
     bombar
    -0.91
    -0.90
    Insights
    -0.90
    POSITIVE LOGITS
     out
    1.76
     up
    1.67
     through
    1.63
     it
    1.52
     with
    1.49
     on
    1.47
     from
    1.29
     backwards
    1.23
     Working
    1.23
     things
    1.20
    Act Density 0.025%

    No Known Activations