INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    Done
    -0.08
     이용
    -0.07
     sneak
    -0.07
     Lose
    -0.07
     spaceship
    -0.07
     startups
    -0.07
    כוונת
    -0.06
    Coach
    -0.06
    oda
    -0.06
    (run
    -0.06
    POSITIVE LOGITS
    _div
    0.07
    ք
    0.07
    uating
    0.06
     hẹ
    0.06
    chers
    0.06
     digging
    0.06
    ulkan
    0.06
    ために
    0.06
     dialect
    0.06
    ql
    0.06
    Act Density 0.002%

    No Known Activations