INDEX
    Explanations

    punctuation marks and formatting characters in the document

    New Auto-Interp
    Negative Logits
     ly
    -0.74
    [toxicity=0]
    -0.62
     lys
    -0.57
    ing
    -0.57
    lys
    -0.57
     Kim
    -0.55
    tsam
    -0.55
     ban
    -0.55
    ValueStyle
    -0.54
    hyd
    -0.53
    POSITIVE LOGITS
     فريبيس
    1.06
    DoubleQuotes
    1.03
    principalColumn
    0.97
    قایناق‌لار
    0.94
    例句
    0.92
    )。
    0.92
    }))
    
    0.85
    )、
    0.83
    0.83
     bekym
    0.82
    Act Density 0.116%

    No Known Activations