INDEX
    Explanations

    mentions of severe consequences or impactful actions

    New Auto-Interp
    Negative Logits
    ice
    -0.15
    ège
    -0.15
     fo
    -0.14
    -in
    -0.14
     Pap
    -0.13
     Fund
    -0.13
    icious
    -0.13
     fre
    -0.13
    archy
    -0.13
     å°
    -0.13
    POSITIVE LOGITS
    PCP
    0.18
    ôt
    0.17
    ãĤħ
    0.16
    /WebAPI
    0.16
    sil
    0.15
    ê·¹
    0.15
     Morales
    0.15
    wert
    0.15
    rog
    0.15
    ques
    0.14
    Act Density 0.022%

    No Known Activations