INDEX
    Explanations

    references to humans and human-related concepts

    New Auto-Interp
    Negative Logits
    gor
    -0.18
    thon
    -0.17
    ses
    -0.16
    holm
    -0.15
    lass
    -0.15
     Horton
    -0.14
    gae
    -0.14
    ern
    -0.14
    upertino
    -0.14
     Usa
    -0.14
    POSITIVE LOGITS
     beings
    0.33
    ely
    0.26
    itarian
    0.25
    -readable
    0.25
    oids
    0.24
    -machine
    0.22
    made
    0.21
    ized
    0.20
    -human
    0.20
    itar
    0.19
    Act Density 0.044%

    No Known Activations