INDEX
Explanations
mentions or references to the word "human."
references to human characteristics and experiences
New Auto-Interp
Negative Logits
arella
-0.82
è¦
-0.81
armac
-0.76
OHN
-0.68
urations
-0.68
liga
-0.68
é¾įå
-0.67
abb
-0.67
effective
-0.66
kick
-0.65
POSITIVE LOGITS
beings
1.34
itar
1.08
oids
1.07
itarian
1.02
readable
0.99
istic
0.96
embryonic
0.93
oid
0.92
ized
0.90
izing
0.88
Activations Density 0.026%