INDEX
    Explanations

    phrases related to cultural or societal norms

    references to social norms and their variations

    New Auto-Interp
    Negative Logits
     Lama
    -0.68
    semble
    -0.66
     Kush
    -0.62
     Sunder
    -0.59
     Newport
    -0.58
    wrapper
    -0.58
     istg
    -0.58
     Tub
    -0.57
     Lizard
    -0.57
     Riverside
    -0.57
    POSITIVE LOGITS
    ativity
    1.31
    ality
    1.18
     ante
    0.97
    atively
    0.92
    als
    0.85
     quo
    0.80
    itionally
    0.79
    eers
    0.79
     prev
    0.76
    essential
    0.75
    Act Density 0.035%

    No Known Activations