INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    dw
    -0.15
    ói
    -0.15
     Ard
    -0.15
    endi
    -0.14
    ноÑģ
    -0.14
    ensed
    -0.14
     UIB
    -0.14
    ToProps
    -0.14
     Burgess
    -0.14
    nes
    -0.14
    POSITIVE LOGITS
    ven
    0.30
    venir
    0.29
    ther
    0.29
    visejÃŃcÃŃ
    0.23
    py
    0.23
    ps
    0.23
    red
    0.22
    ff
    0.22
    ped
    0.22
    vern
    0.21
    Act Density 0.005%

    No Known Activations