INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Again
    -0.07
    Actual
    -0.07
    Restaurant
    -0.07
     контак
    -0.07
    ek
    -0.07
    wait
    -0.06
     mourn
    -0.06
     Surprise
    -0.06
     Film
    -0.06
     Couch
    -0.06
    POSITIVE LOGITS
    to
    0.08
    -to
    0.08
    0.07
    -To
    0.07
    τώ
    0.06
    تماع
    0.06
     reasoned
    0.06
    .enemy
    0.06
    =wx
    0.06
    -two
    0.06
    Act Density 0.002%

    No Known Activations