INDEX
    Explanations

    actions, states, or outcomes

    New Auto-Interp
    Negative Logits
     \
    0.66
    to
    0.58
    oster
    0.56
     (
    0.55
    uri
    0.55
    ong
    0.54
    ،
    0.54
    ang
    0.54
     נו
    0.54
    ito
    0.53
    POSITIVE LOGITS
    0.61
    0.57
    0.57
    ዛት
    0.52
    ام
    0.50
    0.50
    0.48
    ι
    0.48
    0.48
    ла
    0.47
    Act Density 0.231%

    No Known Activations