INDEX
    Explanations

    instances of emotional manipulation or coercion

    New Auto-Interp
    Negative Logits
     Yue
    -0.15
    antar
    -0.15
    iti
    -0.14
    licht
    -0.14
    åº
    -0.14
    pha
    -0.14
    _kses
    -0.14
    ãĥĥãĥĦ
    -0.13
    xFFF
    -0.13
    itu
    -0.13
    POSITIVE LOGITS
    loff
    0.15
     lẫn
    0.15
    ype
    0.14
    ỡ
    0.14
     Eh
    0.14
    nowled
    0.14
    åĿ¡
    0.14
    ysz
    0.13
    rance
    0.13
    strup
    0.12
    Act Density 0.010%

    No Known Activations