INDEX
    Explanations

    specific formatting and structural elements in documents

    numeric values and numerical expressions in code or mathematical notation.

    New Auto-Interp
    Negative Logits
     queſta
    -0.84
     ddelwed
    -0.78
     $_(
    -0.78
     betweenstory
    -0.75
     increí
    -0.73
     dieß
    -0.73
     Verſ
    -0.73
     ſind
    -0.72
     Italijani
    -0.70
     dieſes
    -0.69
    POSITIVE LOGITS
    </
    0.43
    <h2>
    0.43
    <b>
    0.42
    [toxicity=0]
    0.38
    ag
    0.37
    <h1>
    0.37
    به
    0.36
    ></
    0.36
     </
    0.36
     appunto
    0.36
    Act Density 1.825%

    No Known Activations