INDEX
    Explanations

    references to human beings and their experiences

    New Auto-Interp
    Negative Logits
     Human
    -0.22
     human
    -0.20
    _human
    -0.19
     Humanity
    -0.19
    umper
    -0.19
    Human
    -0.18
    人类
    -0.17
    appen
    -0.17
    gers
    -0.17
    eson
    -0.16
    POSITIVE LOGITS
     beings
    0.43
    oids
    0.29
    ely
    0.29
    itarian
    0.28
    eness
    0.27
    -machine
    0.24
    istic
    0.24
    -readable
    0.24
    OID
    0.23
    itar
    0.22
    Act Density 0.051%

    No Known Activations