Electronic Theses & Dissertations

Using Apache Spark for Scalable Gene Sequence Analysis

Muthahar Syed

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science and Info Sys

Date of Award

Spring 2016

Abstract

Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences. Genetic sequences contain the details of human DNA, and analysis of these large-scale sequencing data is the primary concern. This thesis introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). Spark framework provides an efficient data reuse feature by holding the data in memory. Holding the data in memory significantly reduces the data access time and thus increases performance. The experimental approach to demonstrate the scalability of this proposed system is implemented on Spark parallel computing cluster implemented on top of Yet Another Resource Negotiator (YARN). Experiments detailed in this thesis make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system. I further implemented a web-based interface where users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory, providing optimal results.

Advisor

Jinoh Kim

Subject Categories

Computer Sciences | Physical Sciences and Mathematics

Recommended Citation

Syed, Muthahar, "Using Apache Spark for Scalable Gene Sequence Analysis" (2016). Electronic Theses & Dissertations. 997.
https://lair.etamu.edu/etd/997

Link to Full Text

COinS

Electronic Theses & Dissertations

Using Apache Spark for Scalable Gene Sequence Analysis

Document Type

Degree Name

Department

Date of Award

Abstract

Advisor

Subject Categories

Recommended Citation

searchSearch

screen_search_desktopBrowse

edit_documentAuthor Corner

Links

Electronic Theses & Dissertations

Using Apache Spark for Scalable Gene Sequence Analysis

Author

Document Type

Degree Name

Department

Date of Award

Abstract

Advisor

Subject Categories

Recommended Citation

Share

searchSearch

screen_search_desktopBrowse

edit_documentAuthor Corner

Links