When you think about organizations that face big data analytics challenges, web companies like Facebook, Netflix and Google typically come to mind. You might also point to online retailers, which have access to huge stores of clickstream and customer data. Scientific research labs doing genomic data analysis aren't in the public eye as much, but they're increasingly in the thick of things with big data.
Genomic data -- information on human or animal genomes and the DNA they contain -- is swelling like a tidal wave. That's pushing researchers looking to mine and analyze all that data to think about new data architectures, and some are finding that the Apache Spark processing engine and other big data technologies are a good fit for their work.
The first human genome took about a decade to sequence at a cost of nearly $3 billion. But, as the available methods improved, both the time and cost of sequencing DNA have dropped precipitously. Today, genomic data analysis is a growing focus of scientific research, with much of the work aimed at finding new ways to treat diseases. Aided by such efforts, a number of treatments that are tailored to the specific genetic characteristics of patients are becoming available for medical conditions, like cancer, heart disease and diabetes.
But all of the genomics activity is creating a huge data crunch. A 2015 research paper published in the journal PLOS Biology estimated the amount of genomic data produced over the next 10 years would outpace the data volumes generated by astronomy-related organizations and by both YouTube and Twitter.
A clear need for data analytics speed
With so much data flooding in, "it's going to require innovations in computing to maintain our current pace in biomedicine," said Cotton Seed, a senior principal software engineer at Broad Institute, a collaborative research center in Cambridge, Mass., that was set up by MIT and Harvard in 2004.
For Seed, a lot of that innovation is happening in Spark. Speaking at Spark Summit East 2017 in Boston last week, he said he and his team built a genomic research platform on Spark that leverages the technology's SQL querying function and library of machine learning algorithms to speed up the data mining and analytics process.
Broad Institute is currently working on projects to map out genetic traits that tend to be associated with certain types of cancer and the genetic makeup of microorganisms that live in the human body, among other initiatives. Seed said Spark is useful in those efforts because it can connect to different data stores and lets researchers interact with it in different query languages -- SQL, Python or Scala, whichever most closely fits their work. When they're writing queries, "it's important that [researchers] be able to 'speak' as close as possible to the languages of biology," he said.
The speed with which Spark handles large data volumes and its scalability also make the platform attractive for genomic data analysis and data mining uses, said Zhong Wang, a computational biologist and genomics researcher at Lawrence Berkeley National Laboratory in Berkeley, Calif., during another presentation at the Spark conference.
Wang heads a research team that studies the genetic-level interactions between microorganisms in the guts of animals. The studies produce far too much data for anyone to mine and interpret manually in a spreadsheet, so the team uses Spark and machine learning algorithms to parse the data and identify meaningful correlations.
Spark adds more processing power
Prior to adopting Spark, Wang and his colleagues in 2009 deployed a six-server Hadoop cluster to run their analyses, using the Apache Pig scripting and analysis platform. But processing times were slow, he said. Also, the researchers were trying to build graph-based algorithms, which weren't very compatible with a MapReduce-based programming environment like Pig.
A couple years later, the team switched to Spark running against data stored in Amazon EMR, a cloud-based Hadoop distribution from Amazon Web Services that was formerly known as Elastic MapReduce. Wang said the Spark system has improved processing times, even as the amount of mined data moving through the platform continues to grow.
Like Seed, Wang said the ability to write applications for Spark in a variety of fairly easy-to-learn languages is another plus. It means researchers like him can do most of the development work needed for genomic data analysis projects, rather than having to rely on data engineers or data scientists. "I'm not trained as a computer scientist, but I can write Scala and Python Spark applications," Wang said. "It's not possible to hire an expensive engineer just to do this [for us]."
Spark increasingly moves toward the center of big data environments
Visual data discovery is key to advances in genomic healthcare
Genomic data is coming, but are doctors ready to use it?