INDEX
    Explanations

    specific numeric references or identifiers

    New Auto-Interp
    Negative Logits
     للاسماء
    -1.09
     leaſt
    -0.97
    Дереккөздер
    -0.96
     ―――――
    -0.94
     itſelf
    -0.91
    ſelf
    -0.91
     myſelf
    -0.90
    featureID
    -0.90
     himſelf
    -0.88
    ſelves
    -0.88
    POSITIVE LOGITS
    </strong>
    0.76
    </b>
    0.62
    "
    0.56
    0.55
    0.54
     ‘
    0.52
     '
    0.52
    0.50
    0.50
    !
    0.49
    Act Density 0.058%

    No Known Activations