INDEX
    Explanations

    examples given as evidence or explanation

    phrases that introduce examples or instances

    New Auto-Interp
    Negative Logits
    ãĤ©
    -0.71
    jug
    -0.71
    etheless
    -0.69
    chant
    -0.66
    è¦ļéĨĴ
    -0.63
    livion
    -0.62
    Ŀ
    -0.59
     enthusi
    -0.58
    aution
    -0.56
    ãĤ¬
    -0.55
    POSITIVE LOGITS
     example
    1.78
     instance
    1.63
    example
    1.31
    say
    1.24
    Example
    1.13
     Example
    1.11
     for
    1.02
     examples
    1.01
    for
    0.99
     Examples
    0.99
    Act Density 0.423%

    No Known Activations