INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ube
    -0.08
     яких
    -0.07
    UBE
    -0.07
    αι
    -0.06
     scary
    -0.06
    Uber
    -0.06
    Advertising
    -0.06
    erge
    -0.06
     issue
    -0.06
     Festival
    -0.06
    POSITIVE LOGITS
    _continue
    0.07
    0.07
    sage
    0.06
    0.06
    _AR
    0.06
    :not
    0.06
    !(:
    0.06
    _'.$
    0.06
     radix
    0.06
     hop
    0.06
    Act Density 0.003%

    No Known Activations