Going Green for 1000 Genomes



Loading...

Baylor data center deploys Rackable systems for power savings and performance.

By Kevin Davies

Sept. 5, 2008 | As the NIH 1000 Genomes project kicks into top gear, data center managers are racing to come to terms with the glut of data that they will have to manage from the next-generation sequencing instruments. “This is a problem for all the [genome] centers to deal with, as well as the repositories such as NCBI,” says David Parker, Baylor College of Medicine Human Genome Sequencing Center systems analyst.

Parker says the 1000 Genomes project not only represents a dramatic difference in the scale of the data that are being generated and collected, but “it also changes the compute characteristic.” Parker says that the old Sanger sequencing instruments (“We loved those!”) had a relatively low requirement for power processing. “We could typically have small, slow processors. We never bought the fastest processor because we didn’t need that much CPU time. We could run on single-core machines, without much RAM.”

The image analysis requirements with the new sequencing platforms change all that. The Baylor genome center chiefly uses 454 machines (as it did in completing the sequence of James Watson last year—See, “Project Jim,” Bio•IT World, June 2007), as well as Illumina and most recently Applied Biosystems SOLiD machines.

Now, says Parker, “We’re buying the fastest multi-core processors we can get with all the RAM we can jam in them.” Network performance is a much bigger issue now, as Parker tries to squeeze every ounce of performance from the system.

In preparing for the 1000 Genomes project, Parker says, “We’ve really rethought our entire architecture and basically started from scratch.” In a way, the timing was perfect: having run out of physical space, Parker’s team was expanding the center in a remote location. For once, space is not a problem.

It’s not much more than a back-of-the-envelope calculation, but Parker estimates that the center’s storage requirements are set to expand by 150 Terabytes (TB) a quarter. “That’s just for the original data, not workspace. Workspace is typically 50-100% of that. So roughly 250 TB/quarter,” says Parker.

But there is some good news on the primary data front. Researchers are actually disposing of it. “We’re making progress on it; it’s always a fun subject!” says Parker. “I was buying beer for everyone in the center last week because we actually deleted 15 TB of old data!”

Rack ‘Em
On the storage side, Parker says the platforms have to be more robust, scalable, and able to meet “a constant demand by the researchers for more storage.” When it comes to storage, Parker isn’t terribly picky and says he’s considered just about everything. “We’ve looked at Sun… We’ve done pilots with Isilon, IBRIX, Panasas, BlueArc, NetApp, RapidScale (the old TeraScale stuff that Rackable bought)… There’s no such thing as storage you can’t use.”

He still uses Hitachi SAN-based storage with its virtualization capabilities. “But it’s expensive. I don’t know that’s going to be financially wise as we get into the multi-petabyte range. So we’re looking at alternatives, like LUSTER and CLUSTERFS that can do the same for a lot less money.”

One vendor that Parker is sold on is Rackable Systems, a Fremont, Calif. server/storage manufacturer that prides itself on its “ecological” equipment. Parker first started using Rackable servers about five years ago, after contacting “every cluster vendor on the planet,” including IBM, HP, and Western Scientific, to assess the costs, benefits, and alternatives. “They have the kind of density for floor space I can get from blade servers, yet there is no cost premium for that density. Blade servers still cost more money than 1-U servers. And you don’t have the flexibility. Blade servers are certainly wonderful things, but I can buy the same number of processors in the same footprint with Rackable, pay a lot less money and use a lot less electricity and generate a lot less heat.”

The reason for Rackable’s “green” reputation is that the servers use DC power—the servers feature rectifiers at the top of the cabinet. That improves efficiency and saves a lot of trees. “We estimate it saves 30% on the power and on the cooling,” says Parker.

Parker says he’s not sacrificing anything by giving up a blade server. “If you buy a blade server the motherboards are proprietary,” he says. For example, if he buys from HP, he can only buy HP replacements. Rackable, on the other hand, uses off-the-shelf parts. “If I don’t like the mother boards they ship me, I can buy them myself.”

Aside from flexibility, lower cost, and a good number of processors per rack, Parker says, “I also like them because they just do excellent work. It rolls in; it’s wired the way I want it; it has the network switches I want. It’s just beautiful!”

Before joining Baylor, Parker had a consulting company with his brothers. But in 2001, he laughs, “Our customers went out of business!” He calls the past five years “a great adventure.” It sounds like the fun is just beginning. 

___________________________________________________

This article appeared in Bio-IT World Magazine.

Subscriptions are free for qualifying individuals.  Apply Today.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

Quantum
StorNext 4.0: Technical Product Brief
Sponsored by Quantum

 
Proven in the world’s most data intensive industries, Quantum StorNext is a scalable, high-performance file system which allows data sharing across Linux, Mac, Unix, and Windows operating systems and manages data in enterprise storage environments. In this Technical Brief you'll learn:

  • How a high-performing file system can accelerate your business
  • How to simplify your data management
  • How a tiered storage approach can save you money


SURETY-IP_WPx108
Protect Your Scientific Intellectual Property: Proof of Lab Informatics Data Authenticity is Your Best Legal Defense
Sponsored by Surety, LLC

As a bio-technology or life sciences organization, your formulas, treatments and research and discoveries are the “lifeblood” of your business. But if you aren't protecting the integrity of your scientific data in your lab informatics systems, you risk losing IP ownership, revenue and consequently your business if you can't prove time-of-creation and data authenticity. Learn how you can implement simple, cost-effective and automated controls to protect your scientific intellectual property. Consider:

  • IP protection requirements in bio-pharma and other science-oriented industries can extend out 20, 30, 40 or more years
  • Most electronic lab management solutions include generic authenticity controls, so how "legally defensible" is yours?
  • Only standards-compliant, independent controls can future-proof your approach to long-term IP integrity protection and authenticity.
  • Learn more - get the free whitepaper now


BlueArc_WP_DataMigration.jpg
The Key to Life Sciences Data Management: Transparent Migration
Sponsored by BlueArc

Life sciences organizations face new data management challenges as the volume of research data grows and more data is kept online for longer times. Read this paper to learn about:

  • The benefits of transparent data migration (TDM)
  • How TDM technologies can simplify data management.
  • How using TDM can help increase storage utilization, improve computational workflow performance, and optimize the use of storage resources.


Life Science Webcasts & Podcasts

adobe_i3_btn_webinarNext-Generation Clinical Trial and Data Management Applications
Sponsored by Adobe

This webinar introduces i3Cube - a web-based, fully integrated, clinical trial and data management system built on Adobe’s LiveCycle® Enterprise Suite.  I3 cube provides end-to-end automation that delivers unprecedented visibility into information that sponsors need to accelerate the study process and complete trials efficiently. Viewers will learn more about:

  • Creating faster and more efficient trial processes
  • Reducing investigator burden 
  • Real-time sponsor transparency into study information
  • Enterprise solutions based on Adobe LiveCycle® ES utilizing cross-platform clients of Reader, Flash and AIR

    Download now.



More Podcasts

Job Openings

Employers -- Don't miss this opportunity to reach well-qualified life science candidates.

Loading...

For reprints and/or copyright permission, please contact The YGS Group, 3650 West Market Street, York, PA;

(717) 505-9701 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.