INDEX
Explanations
references to humans and human-related concepts
New Auto-Interp
Negative Logits
gor
-0.18
thon
-0.17
ses
-0.16
holm
-0.15
lass
-0.15
Horton
-0.14
gae
-0.14
ern
-0.14
upertino
-0.14
Usa
-0.14
POSITIVE LOGITS
beings
0.33
ely
0.26
itarian
0.25
-readable
0.25
oids
0.24
-machine
0.22
made
0.21
ized
0.20
-human
0.20
itar
0.19
Activations Density 0.044%