INDEX
    Explanations

    implies/indicates followed by consequence/explanation

    New Auto-Interp
    Negative Logits
     in
    0.84
     of
    0.70
     at
    0.64
     was
    0.64
     is
    0.63
    of
    0.63
     i
    0.60
    are
    0.57
     wenn
    0.55
     are
    0.55
    POSITIVE LOGITS
    N
    0.50
    ת
    0.46
    B
    0.41
    ין
    0.41
    T
    0.39
    Peach
    0.38
    Z
    0.37
     nonconvex
    0.37
    다고
    0.36
    0.36
    Act Density 5.908%

    No Known Activations