August 2, 2011 | SAN DIEGO—A new supercomputer at the University of California San Diego (UCSD)’s San Diego Supercomputer Center (SDSC) named Gordon, featuring a quarter of a petabyte of flash memory (hence the name) and which has been dubbed “the world’s largest thumb drive,” earned raves from Larry Smarr in a wide-ranging talk about the future management of life sciences data.
Smarr has spent ten years building the California Institute for Telecommunications and Information Technology (Calit2)—a joint program between the UCSD and UC Irvine . Speaking at CHI’s XGen Congress*, Smarr urged organizations and universities to radically retool their approaches to computing infrastructure to facilitate collaboration, data sharing, telecommunication, and next-generation sequencing (NGS) data.
Although he has long advocated the transformative nature of optical networking, he says, “the ability to have your own personal 10,000 megabit/second (mbps) optical link is what we really need to deal with NGS machines. We’re trying to do data-intensive science on an infrastructure—the shared Internet—that was never meant for that.” But from the shared Internet to dedicated high-performance optical networks, much else has to change as well. “The last 100 feet aren’t there,” he said.
Many of the innovations Smarr and colleagues have deployed across the UCSD campus—featuring optical fiber—provide a shining example of the university campus of the future. High-definition video streams can be sent as live feeds from microscopes and tiled LCD walls (driven by PCs with NVIDIA graphics cards), allowing microscopy collages featuring 600 million pixels to be viewed.
Smarr says each optical fiber has independent infra-red channels, each providing 100-1,000 times greater data throughput compared to the existing Internet. And instead of 200 university campuses going through one channel, “I’m saying you should have one yourself.” For example, the National Lambda Rail (with many 10GbE paths on their fibers) connects large data research centers in California and around the world.
Making the Switch
With Calit2 and the SDSC on its campus, it is no surprise that UCSD has jumped ahead in improving the campus cyberinfrastructrure. UCSD now boasts 60 10GbE paths across campus, in parallel with the shared Internet, eliminating data bottlenecks.
Users can choose which layer of the Internet to send data using a simple 3-level switch (0.5 terabits/sec). “Think about the clusters on campus and the space and energy they use,” says Smarr incredulously. “They’re completely isolated into islands, connected to the Internet at 10 megabits/sec. They’re 1,000:1 isolated from the rest of the world. You’re putting all your money into those instead of a fairly inexpensive optical switch? Whatever.”
UCSD has also brought optical fiber to the NGS facilities on the medical campus. “There’s nothing wrong with the shared Internet for email; it’s what it’s built for. But it’s not useful for where we’re going.” Trey Ideker, who heads the systems biology group, is starting to generate more NGS data (see “Groundbreaking Work”).
The UCSD campus has centralized data storage at SDSC that Smarr equated to the old library in the center of the campus. “Imagine a digital aquifer under the grass. All researchers get to use that data oasis. Then you plug in these 10 Gbps optical fibers.” Or imagine taking the output from an Illumina NGS instrument and putting it in RAM.
Smarr is no stranger to working with genomics data, having collaborated for years with Craig Venter on the CAMERA project, a global microbial metagenomics community research (see, “CAMERA Database Snaps into Action,” Bio•IT World, Apr 2007). CAMERA’s IT infrastructure boasts 512 processors, 5 TeraFlops and 200 terabytes storage.
“You can take your genome and BLAST it against the entire dataset. We now have more than 4,000 users in 90 countries, all connected to Calit2’s CAMERA cluster. If a researcher has a dedicated 10Gbps connection to Calit2, they can use uncompressed, high-def feeds at 1,500 megabits/second. “This avoids latency—the enemy of real-time collaboration. This is the kind of thing you can do once you have this infrastructure in place.”
“The cost of electricity is becoming unbearable,” said Smarr. UCSD is already a 40MW campus and additional computers are becoming the most important driver of higher electricity demands. Smarr is part of an NSF grant, the GreenLight Project, that is adapting Sun modular data centers to measure a series of metrics including temperature, airflow, etc. on various applications running on various architectures—from multicores to GPUs, FPGAs, routers, and storage. “At the end of day, we have to know it costs this much for electricity or CO2 production. You’ll see this more and more. Universities have got to get on top of electricity costs.”
Smarr calls the SDSC’s new 245-TeraFlop supercomputer, Gordon, “the first high-performance data computer in the academic world. It has 256,000 GB [a quarter of a petabyte] of flash memory, that’s more flash in one place than anywhere in the world. We thank Steve Jobs for making flash memory cheap enough!” Smarr’s colleague Michael Norman, SDSC director, says Gordon “will do for scientific data analysis what Google does for Web search.”
In a normal computer, with tens of gigabytes of RAM, most data sits on the disk. “But disk is 100X slower than memory. You’re disc I/O limited, waiting for the disk to get data to the RAM. Now imagine you have terabytes of RAM. You can put all your data in there at once. Then algorithms completely change.”
Gordon has 32 nodes, each with 2 TB RAM, 8 TB Flash SSD (sold state drive), and a 4-PB parallel disc farm (file system). “There’s nothing like it in the world,” says Smarr. “When I think about next-gen sequencing, Gordon is the machine almost built for this. De novo assembly will benefit from large shared memory. This is not your father’s supercomputer. It’s a high-performance computer designed for data intensive science, just like supercomputers were optimized for solving differential equations. Federations of databases and interaction networks will benefit from low latency I/O from Flash.”
The construction of Gordon, funded by a $20-million NSF grant, has also benefited from the plummeting price of 10GbE switches. In 2005, Smarr said the cost of a 10GbE port was around $80,000. In 2011, a single port is less than $1,000. Gordon will have 128 parallel channels, each 10GbE. “We now use 10GbE paths in the back-end like they’re popcorn! 10G is the new 1G. Apple is shipping MacBooks with two 10-Gbps ports! People still act like 10GbE is a lot—I don’t get it.”
Smarr had some less positive views on Cloud computing, however. “The Cloud is not set up for terabyte or gigabyte files,” he said. “You can get there, but once you’re inside, there isn’t the SDSC 10GbE farm to move your data around. How much to get it back out? What do you pay for egress and exit? There are lots of developments necessary for commercial clouds to be useful for science.”
“You need to understand you have a problem,” Smarr continued. “I have data! It’s exponentially growing. It just boggles my mind how otherwise intelligent places aren’t dealing with it. It’s hard—you have to bring together experts who normally don’t talk to each other—biologists, computer scientists, engineers. You have to bring together... the School of Medicine, the campus, networking/storage, departments. Now it’s a collective problem. No-one has enough money themselves.”
“People don’t think about exponentials, but they make the impossible routine as we go through the threshold you care about. It’s impossible to plan for. It cost a couple of billion dollars for first human genome. Now it’s $1,000?! That’s a factor of 1 million in ten years. Over that time, Moore’s Law is 1,000. It’s the square of Moore’s Law.”
“The 10GbE data superhighway is coming into being. NGS is its most important application for science, because of the democratization of sequencing.” •