INDEX
    Explanations

    references to manipulative or deceptive behavior in social contexts

    New Auto-Interp
    Negative Logits
    ót
    -0.15
    vit
    -0.15
    vÃŃ
    -0.15
     smarty
    -0.14
     bias
    -0.14
    KHTML
    -0.14
     дÑĥ
    -0.14
    reeNode
    -0.13
    indow
    -0.13
    probe
    -0.13
    POSITIVE LOGITS
     Pickup
    0.29
     pickup
    0.28
     incel
    0.23
    pickup
    0.23
     pickups
    0.22
    PU
    0.21
     PU
    0.20
     kino
    0.20
     pick
    0.19
    pick
    0.19
    Act Density 0.067%

    No Known Activations