INDEX
    Explanations

    mathematical expressions and notation, particularly those involving powers and norms

    New Auto-Interp
    Negative Logits
    ál
    -0.73
     Laird
    -0.62
    le
    -0.62
     Diana
    -0.58
    ala
    -0.58
    lek
    -0.58
    ity
    -0.57
    B
    -0.57
     Vela
    -0.55
    ora
    -0.54
    POSITIVE LOGITS
    )^{
    2.46
    })^{
    1.77
    )^{\
    1.68
    )^
    1.62
    |^{
    1.56
    ]^{
    1.45
    )^(
    1.39
    )_{
    1.35
    })^
    1.32
    )|^{
    1.32
    Act Density 0.173%

    No Known Activations