By Kevin Davies
November 10, 2009 | Cloud computing and cloud storage are two sides of the same coin,” explains Sajai Krishnan, CEO of Silicon Valley’s venture-funded ParaScale. Cloud computing has had virtualization tools such as VMware and Xen to do most of the heavy lifting, eventually growing until it necessitated a cloud service. “The equivalent to VMware in the cloud storage space is ParaScale,” says Krishnan. Of course, storage competitors including EMC, HP, NetApp, and Nirvanix might dispute that.
ParaScale software “takes you out of the box model of storage, equivalent to the one app/one server or mainframe in a proprietary computing stack, and puts you in the space of harnessing an aggregate pool of commodity storage that is managed easily as if it were one single appliance, with storage capacity being used as needed by different sets of applications,” says Krishnan. “Whether it’s inside Lilly or inside Amazon, that’s just a business model.” ParaScale uses Linux servers, the cheapest enterprise servers. “The cheapest storage you can find is inside a Linux server,” says Krishnan. Aggregating the disks inside a hundred Linux servers into one massive pool of storage, users can “slice and dice” as they see fit.
While VMware saves a lot of money on the compute side, Krishnan argues it significantly increases complexity on the storage side, because users typically need a shared SAN (storage area network) and IT experts who understand not only virtualization but also SAN storage management. “Cloud storage not only addresses the complexity issue but it also addresses the exploding need around managing storage associated with the latest sequencers,” he says, which pump out more than 10 Terabytes/week. “It’s an ungodly amount of data! You have a set of scientists being turned into storage administrators. I know one Ph.D. in the U.K. who is single-handedly managing more than 1 Petabyte data. Somebody should give the guy a medal!”
For storage, users can benefit from having a private cloud, says Krishnan (see, “Stanford Goes Private”). In many cases, the economies of owning it are superior to the economies of renting from Amazon. Whereas many cloud compute users want on-demand flexibility, cloud storage is a more steady-state phenomenon. Cloud computing suits users with dynamic usage patterns, people who might need 1,000 nodes to do some analysis over 3 weeks and then over the next 3 months you don’t. “It’s the perfect situation to go to an Amazon compute cloud,” says Krishnan.
Storage, on the other hand, doesn’t have the same dynamism and tends to steadily grow in capacity. “Without the requirement to handle huge bursts in storage capacity over short periods, the public storage cloud may not provide you that same sort of peak-demand benefit. But the private storage cloud will be the right answer for many companies since it gives you affordability, performance and in-datacenter security. Why should only Amazon have the technology leveraging commodity storage? Why shouldn’t Lilly? Why shouldn’t Home Depot etc. own their own private clouds? Clearly their IT teams have the expertise and scale to manage vast IT installations.”
NAS or SAN
Krishnan says the vendor community is “glomming onto cloud storage in a big way,” but confusing the issue. “There was until recently only one type of cloud storage,” he says. The traditional kind of cloud storage is NAS (network attached storage), or file storage, as exemplified by Amazon S3 and the Google cloud system. But increasingly, people are talking about SAN (block) storage. This terminology is given to the storage directly associated with cloud computing. Companies such as 3PAR, EMC and NetApp provide storage that is directly associated with cloud computing.
SAN-type storage, similar to Amazon EC2, is not large in terms of Terabytes, maybe 100-200 TB. “That’s what my friend in the U.K. chomps through in a couple of months. So that’s not the cloud storage that’s going to solve the data-sequencing problem. That’s where you need the NAS-type cloud storage.”
“We’re just easy to manage, inexpensive file storage,” Krishnan continues. “Wherever someone has NAS, you end up where 90% data, while valuable, is not necessarily the kind that needs to sit on expensive tier 1 storage. Moving that to tier 2 cloud storage saves everyone a ton of money.”
Krishnan says customers have two major choices—ParaScale and EMC Atmos. ParaScale is software only, whereas EMC is both hardware and software. Krishnan points to parallels with VMware. A company such as Eli Lilly, for example, can buy ParaScale’s software and load it onto standard commodity servers. “The beauty of this thing is you’re able to use whatever hardware the companies have. Some are big with Dell, others HP—we’ll go with it. Many bioscience companies have large compute clusters that keep getting refreshed as bioscientists are looking for that edge in performance.” Rather than throw away those older servers, Krishnan says ParaScale can turn them into valuable, scalable storage, simply for the cost of the software.”
“We’re in the business of enabling 100 Amazons... We sell our software to customers who want to get into the business of competing with Amazon.”
One of the virtues of cloud storage is reduced administrative costs. A user might manage a cluster of petabytes with half a FTE. “A ParaScale cloud might be composed of hundreds of servers, but it will feel like one appliance. It’s fully automated, whether servers die or you add new servers. The software has enough redundancy that there’s nothing other than scheduled downtime. Every Friday afternoon, your IT admin gets on his rollerblades and goes around the datacenter. He sees the racks that have red lights—failed discs, servers gone—and goes ahead and replaces them, and puts them back into the cloud. All the while users continue to access their data, as nothing is amiss in their view.”
It’s clear to Krishnan that these are early days. Access to hundreds of servers in a cloud provides huge parallel bandwidth, in addition to mere cost or capacity benefits. “This is something you don’t easily get at Amazon,” argues Krishnan. “This is what you get when you understand the cloud, and you have a private storage cloud.” Krishnan says ParaScale is in the early stages with a couple of big pharma companies, as well as a next-gen sequencing company.
Cloud storage is not necessary for cloud computing, but its good economic practice, says Krishnan. Cloud computing can save money, but the associated SAN storage is expensive. “Now you have a choice,” says Krishnan. “For the few Terabytes you actually need, use a SAN. Many vendors will call that ‘cloud storage,’ but that’s a few TB. For your other file data, be smart about it. Put it on the file type of cloud storage, like Amazon S3. Buying software from ParaScale allows you to do that. Now you can fully realize the savings you thought you were going to have when you went to cloud compute.”
Stanford Goes Private
ParaScale has built a private cloud compute environment for the Stanford Genome Technology Center. Until recently, the Stanford team scrapped next-gen sequencer images, even though archiving allows for re-analysis as new algorithms are developed. Stanford’s private cloud produces an easily managed, scalable storage pool. The sequencers now write directly to the cloud.
The Stanford staff built their cloud using older machines retired from the HPC cluster—a mélange of different hardware, memory and disk sizes. ParaScale’s cloud storage “has the scale and economy to handle our genomic data, and it is easy enough for our research scientists to manage,” says Baback Gharizadeh, research associate at the SGTC.
To add storage, the Stanford staff installs ParaScale software on any commodity hardware. The cloud detects the new machine and adds its storage to the cloud. As the cloud detects this additional storage, it starts replicating data. Using thin provisioning, the Stanford team can create file systems much larger than the physical storage space available in the cloud. Once they reach capacity, the cloud alerts the staff to add more storage nodes to the cloud.
The entire cloud can be administered from a single point of management, regardless of the size of the storage or the number of nodes. The cloud also supports policy-based replication, which ensures that enough copies of the data exist within the storage cloud. Once the analyzed data have been written to the cloud, it can be accessed using a simple web browser and shared with collaborators at other institutions.
This article also appeared in the November-December 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.