Dealing With Fast Growing Data With Hyperscale Data Distribution

Contributed Commentary by Chin Fang

August 17, 2018 | The biopharmaceutical industry is facing fast growing data movement challenges. The latest powerful genome sequencers can spit out multiple TBs of data in each run; modern medical imaging systems are producing image files with ever-growing resolution (and thus larger and larger sizes); in an active lab, the amount of personalized medicine research data is growing by multiple TBs/day, to name just a few examples. The large entities of the industry are often distributed, even global in nature. Multi-site collaboration is mandatory to make research progress. These, plus other needs, such as disaster recovery, regulation compliance, and digital asset protection, make the problems even more acute.

Up to about two years ago, the existing data transfer solutions, both commercial and free, were able to get the job done, if not always satisfactorily. But the industry is fast changing, so a new way of thinking and doing becomes mandatory. It is time to introduce the concept of "hyperscale data distribution". “Hyperscale data” means data sized at 1TB and above, to be transported at a rate of 20Gbps and beyond, over any distance. The word "Distribution" reflects the industry's collaboration practices, which often involve parties at multiple sites (possibly on-prem and clouds too), rather than just a single pair of endpoints.

Nevertheless, hyperscale data distributions are not run-of-the-mill activities. Many IT practitioners in the biopharmaceutical industry, perhaps owing to their personal experience with moving end user data, do not think much about the formidable challenge of this endeavor. But in my experience working intensely with my team since 2014 to create an effective solution for hyperscale data distribution, a solid understanding of the four IT stacks and how they work together will preempt many data transfer problems.

Before the data Tsunami hits, make sure your firm's IT teams understand well the inter-dependencies of the four stacks: storage, computing, networking (and network security, e.g. a firewall should not be a bottleneck!), and the software stack (especially the concurrency aspect). The optimization of the data movement setup must be done in a holistic manner, not by each stack separately.

Hyperscale data distributions are not simple tasks that you can find recipes or cookbooks for. Despite the fact that sites like the U.S. DOE ESnet’s faster data knowledge base may make it tempting to do so, do not use the “tips” published there blindly. Use them as the starting points of your own investigations! Most likely, you will need some kind of machine learning and discrete optimization techniques for tuning the entire setup from time to time. It should be evident that each data transfer setup is formed with a discrete number of servers. Such servers use discrete numbers of CPUs; each has a discrete number of cores. The amount of memory is measured in discrete numbers. The number of network interfaces, the utilized ports, and their speeds are all in discrete numbers. The storage service's queue depth is another discrete number. The number of threads dedicated to a processing task again is a discrete number. Furthermore, different companies have different environments, consequently it's unlikely that an expert can quickly grasp all the differences and “make it right” in all cases. A computational approach is likely far more effective.

Never treat such endeavors as network-alone tasks. Today, in 2018, I still run into pleas on the Internet published by people in the cancer research circle to the freeware community to come up with a clone of the UDP-based proprietary transport protocol from a large IT vendor. This approach won’t be fruitful.

In fact, network transfer protocols (aka transport protocols) are only of secondary importance in the context of hyperscale data distribution. Storage performance (possibly aggregated) is the most critical. This is easy to see: for a water transport system, if the reservoir is dry or at a low water level, even if the pipes are big and there is enough utility, water pumps cannot push enough water through the system, no matter how good these pumps are. Please note that data transfer software plays the role of pumps in data movement.

A final piece of key advice: never think software alone can make all the magic happen. It won't. You need all four stacks working correctly and well-tuned to match each other.

Chin Fang, Ph.D. is the founder and CEO of Zettar Inc., a Mountain View, California based software startup in the HPC space. Zettar delivers a GA-grade hyperscale data distribution software solution. Since 2014, he personally has transferred more than 50PB data over various distances. Chin can be reached at fangchin@zettar.com.