INDEX
Explanations
references to various aspects of humanity and human experience
New Auto-Interp
Negative Logits
umper
-0.19
appen
-0.18
Humanity
-0.18
Human
-0.18
_human
-0.17
eson
-0.17
gers
-0.17
Horton
-0.16
gable
-0.16
human
-0.15
POSITIVE LOGITS
beings
0.41
ely
0.34
oids
0.33
istic
0.33
itarian
0.31
izing
0.26
ly
0.26
-machine
0.26
ized
0.26
made
0.26
Activations Density 0.046%