INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     krev
    -0.08
     سیاست
    -0.06
    ε
    -0.06
    yn
    -0.06
     Cyan
    -0.06
    _topics
    -0.06
    _states
    -0.06
     cerco
    -0.06
    ylum
    -0.06
     yan
    -0.06
    POSITIVE LOGITS
     Arthur
    0.19
    Arthur
    0.16
    thur
    0.12
    Alice
    0.09
     arthritis
    0.09
     Martha
    0.08
    tar
    0.07
     Arth
    0.07
     Alice
    0.07
    arth
    0.07
    Act Density 0.005%

    No Known Activations