INDEX
    Explanations

    phrases relating to social norms and values

    New Auto-Interp
    Negative Logits
    OLT
    -0.17
    ops
    -0.15
    egra
    -0.15
    rows
    -0.15
    APH
    -0.15
    asti
    -0.15
    .rows
    -0.14
    enda
    -0.14
    orr
    -0.14
     forks
    -0.14
    POSITIVE LOGITS
    ochen
    0.17
    yük
    0.15
    iggs
    0.15
    adle
    0.15
    ç¼
    0.14
     Stuff
    0.14
     etc
    0.14
     quot
    0.14
     nas
    0.14
     Rails
    0.14
    Act Density 0.318%

    No Known Activations