INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     is
    1.34
     \
    0.99
    يل
    0.98
    0.89
     =
    0.85
     hecho
    0.82
    0.81
     首先
    0.80
    ουμε
    0.80
    е
    0.80
    POSITIVE LOGITS
    ig
    1.15
    and
    1.13
    é
    1.08
    us
    1.03
    1.02
    ag
    0.95
    im
    0.93
    0.93
    ut
    0.90
    SE
    0.89
    Act Density 0.023%

    No Known Activations