INDEX
    Explanations

    words associated with conscience and moral responsibility

    New Auto-Interp
    Negative Logits
     kon
    -0.17
    pillar
    -0.16
    Ìĥ
    -0.16
    utto
    -0.15
    rea
    -0.15
    à¹Īว
    -0.14
    ignet
    -0.14
    ilar
    -0.14
    aping
    -0.14
    iled
    -0.14
    POSITIVE LOGITS
    front
    0.28
     front
    0.20
    dem
    0.19
    -front
    0.19
    desc
    0.19
     sequ
    0.18
    Front
    0.17
    descending
    0.17
    science
    0.17
    -dem
    0.17
    Act Density 0.025%

    No Known Activations