INDEX
Explanations
references to human beings and their experiences
New Auto-Interp
Negative Logits
Human
-0.22
human
-0.20
_human
-0.19
Humanity
-0.19
umper
-0.19
Human
-0.18
人类
-0.17
appen
-0.17
gers
-0.17
eson
-0.16
POSITIVE LOGITS
beings
0.43
oids
0.29
ely
0.29
itarian
0.28
eness
0.27
-machine
0.24
istic
0.24
-readable
0.24
OID
0.23
itar
0.22
Activations Density 0.051%