Using Apache Spark for Scalable Gene Sequence Analysis
Document Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science and Info Sys
Date of Award
Spring 2016
Abstract
Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences. Genetic sequences contain the details of human DNA, and analysis of these large-scale sequencing data is the primary concern. This thesis introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). Spark framework provides an efficient data reuse feature by holding the data in memory. Holding the data in memory significantly reduces the data access time and thus increases performance. The experimental approach to demonstrate the scalability of this proposed system is implemented on Spark parallel computing cluster implemented on top of Yet Another Resource Negotiator (YARN). Experiments detailed in this thesis make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system. I further implemented a web-based interface where users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory, providing optimal results.
Advisor
Jinoh Kim
Subject Categories
Computer Sciences | Physical Sciences and Mathematics
Recommended Citation
Syed, Muthahar, "Using Apache Spark for Scalable Gene Sequence Analysis" (2016). Electronic Theses & Dissertations. 997.
https://digitalcommons.tamuc.edu/etd/997