Using Apache Spark for Scalable Gene Sequence Analysis


Muthahar Syed

Document Type


Degree Name

Master of Science (MS)


Computer Science and Info Sys

Date of Award

Spring 2016


Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences. Genetic sequences contain the details of human DNA, and analysis of these large-scale sequencing data is the primary concern. This thesis introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). Spark framework provides an efficient data reuse feature by holding the data in memory. Holding the data in memory significantly reduces the data access time and thus increases performance. The experimental approach to demonstrate the scalability of this proposed system is implemented on Spark parallel computing cluster implemented on top of Yet Another Resource Negotiator (YARN). Experiments detailed in this thesis make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system. I further implemented a web-based interface where users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory, providing optimal results.


Jinoh Kim

Subject Categories

Computer Sciences | Physical Sciences and Mathematics