INDEX
    Explanations

    phrases related to rule-breaking or bending social norms

    negative phrases or statements

    New Auto-Interp
    Negative Logits
     caution
    -0.67
     flirt
    -0.67
     Dickinson
    -0.67
     Arabian
    -0.65
     hiber
    -0.64
     pomp
    -0.62
     shares
    -0.61
     curtain
    -0.61
     troop
    -0.60
     adm
    -0.60
    POSITIVE LOGITS
    turned
    1.29
    cum
    1.17
    sama
    1.14
    selves
    1.13
    sized
    1.12
    style
    1.12
    related
    1.11
    type
    1.06
    induced
    1.05
    san
    1.04
    Act Density 0.164%

    No Known Activations