INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    	op
    -0.07
    	bl
    -0.07
    SESSION
    -0.06
    =url
    -0.06
    Salir
    -0.06
    Val
    -0.06
     Album
    -0.06
     shining
    -0.06
    Handler
    -0.06
     слід
    -0.06
    POSITIVE LOGITS
     dwarf
    0.18
     Dwarf
    0.18
    warf
    0.13
     dwar
    0.12
     twins
    0.06
    важ
    0.06
     African
    0.06
    draft
    0.06
    цин
    0.06
     DW
    0.06
    Act Density 0.001%

    No Known Activations