Title

Identification of Protein Complexes Using Machine Learning (Pybrain and Scikit-Learn) Based on Dna Sequence Data

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science and Info Sys

Date of Award

Fall 2014

Abstract

The National Center for Biotechnology Information (NCBI) provides various information that relate to science and health such as program downloads, databases, submissions, tools, and protocols. The database section is separated into many subsections such as Bioproject (formerly Genome Project), BioSample, Bookshelf, GenBank, and Nucleotide Database. This study especially focused on Nucleotide or DNA sequence database. With such large size data sets, researchers are able to use various theories and algorithms to extract and mine the knowledge. This study applied machine learning approaches for classification and identification of protein complexes by using DNA sequence inputs.The construction of artificially intelligent systems that take in and analyze data in order to improve themselves and create better interaction with the data is called machine learning. The system effectively learns how to do their job better. Machine learning is the most useful tool for research and an exquisite sample of learning from examples (Love, 2014). This study used Python-Based Reinforcement Learning Artificial Intelligence and Neural Network Library (PyBrain) as a modular machine learning library and supervised learning algorithms as structure algorithms inside the machine.In this study, the data were separated into two sets. The first set of the data was used to train the machines that had different algorithm structures. The second set of the data was used to test the accuracy of the machines. The machine models were built with specific parameters. They were trained and tested by the datasets. The results from the models were visualized by using line charts and clustered column charts.From the result of six types of protein complex datasets, the machine that had the best accuracy and learning rate was the Resilient propagation machine model, with 99.98% accuracy and a fast learning rate, compared with others. The accuracy of Back propagation machine model was 97.76%. The accuracy of support vector machine models was 93.60%. The accuracy of stochastic gradient descent machine model was 98.38%.

Advisor

Suh C. Sang

Subject Categories

Computer Sciences | Physical Sciences and Mathematics

COinS