INDEX
    Explanations

    references to behavioral change and social conduct

    New Auto-Interp
    Negative Logits
    599
    -0.16
    elow
    -0.14
    inding
    -0.13
     insult
    -0.13
    ά
    -0.13
    maj
    -0.13
    ibil
    -0.13
    à¥ĩदन
    -0.13
    ाहत
    -0.13
    üç
    -0.13
    POSITIVE LOGITS
     behavior
    0.88
     behaviour
    0.81
     Behavior
    0.73
     behaviors
    0.73
    behavior
    0.68
    è¡Į为
    0.67
     behaviours
    0.64
     conduct
    0.63
     Behaviour
    0.63
    Behavior
    0.60
    Act Density 0.451%

    No Known Activations