INDEX
    Explanations

    simple to understand and implement

    New Auto-Interp
    Negative Logits
     tedes
    0.45
     librarian
    0.43
     venue
    0.42
     barbers
    0.41
     scholar
    0.41
     recruit
    0.41
     chickpeas
    0.40
     khe
    0.40
     slay
    0.40
     stim
    0.40
    POSITIVE LOGITS
     свое
    0.54
    ewater
    0.42
     своем
    0.42
     Norweg
    0.42
    avanje
    0.41
    ний
    0.41
    0.41
    0.40
    elu
    0.40
     своей
    0.39
    Act Density 0.004%

    No Known Activations