INDEX
    Explanations

    linguistic expressions related to rules, regulations, and standards

    concepts related to compliance and social norms

    New Auto-Interp
    Negative Logits
    `.
    -0.66
    iven
    -0.62
    Written
    -0.61
    Fra
    -0.58
     ];
    -0.56
     ],
    -0.56
    åĪ
    -0.55
    ãĤ»
    -0.55
    ando
    -0.55
    Dim
    -0.54
    POSITIVE LOGITS
     deserve
    1.14
     are
    1.05
     tended
    1.02
     aren
    1.00
     tend
    0.99
     have
    0.99
     cannot
    0.98
     must
    0.94
     may
    0.93
     shouldn
    0.93
    Act Density 0.470%

    No Known Activations