INDEX
    Explanations

    words related to manners and behavior towards others

    New Auto-Interp
    Negative Logits
     GOODMAN
    -0.86
     disproportion
    -0.73
    arnaev
    -0.73
    senal
    -0.69
    VL
    -0.68
    ilion
    -0.68
     Layer
    -0.66
    Offline
    -0.66
    idem
    -0.64
    LR
    -0.64
    POSITIVE LOGITS
     embraced
    0.86
     entertained
    0.84
     parted
    0.84
     greeted
    0.83
     awaiting
    0.83
     inquired
    0.82
     awaited
    0.82
     welcomed
    0.81
     complied
    0.81
     accepted
    0.80
    Act Density 0.073%

    No Known Activations