INDEX
Explanations
mentions of the word "human" in the text
references to human characteristics and attributes
New Auto-Interp
Negative Logits
ãĥ´ãĤ¡
-0.82
terness
-0.75
RAG
-0.71
é¾įå
-0.70
forth
-0.69
arity
-0.69
Franch
-0.68
raise
-0.67
åĭ
-0.65
bg
-0.65
POSITIVE LOGITS
oids
1.22
beings
1.17
readable
0.98
oid
0.92
zee
0.88
made
0.87
embryonic
0.79
traffickers
0.76
colonists
0.76
genome
0.75
Activations Density 0.036%