Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Thursday, May 23
2:00 PM - 2:40 PM
Marquis Salon 14

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

Presentation Video


Zhong Wang
Group Leader
DOE Joint Genome Institute
Dr. Zhong Wang is a career computational biologist and group leader for genome analysis at DOE Joint Genome Institute (JGI); he is also an adjunct professor at University of California at Merced. He received his Ph.D. in Cell Biology from Duke University in 2004. He did his postdoc in the Institute of Genome Science and Policy at Duke University before becoming a research scientist at Yale University in 2008. He joined DOE Joint Genome Institute in 2009 and established his independent research in transcriptomics, metagenomics, and big data analytics. Dr. Wang published over 30 high-quality papers including several on Science and Nature. More information about his research can be found at