An Approach to DNA Sequence Classification Through Machine Learning: DNA Sequencing, K Mer Counting, Thresholding, Sequence Analysis

An Approach to DNA Sequence Classification Through Machine Learning: DNA Sequencing, K Mer Counting, Thresholding, Sequence Analysis

Sapna Juneja, Annu Dhankhar, Abhinav Juneja, Shivani Bali
Copyright: © 2022 |Pages: 15
DOI: 10.4018/IJRQEH.299963
Article PDF Download
Open access articles are freely available for download

Abstract

Machine learning (ML) has been instrumental in optimal decision making through relevant historical data, including the domain of bioinformatics. In bioinformatics classification of natural genes and the genes that are infected by disease called invalid gene is a very complex task. In order to find the applicability of a fresh protein through genomic research, DNA sequences need to be classified. The current work identifies classes of DNA sequence using machine learning algorithm. These classes are basically dependent on the sequence of nucleotides. With a fractional mutation in sequence, there is a corresponding change in the class. Each numeric instance representing a class is linked to a gene family including G protein coupled receptors, tyrosine kinase, synthase, etc. In this paper, the authors applied the classification algorithm on three types of datasets to identify which gene class they belong to. They converted sequences into substrings with a defined length. That ‘k value' defines the length of substring which is one of the ways to analyze the sequence.
Article Preview
Top

Introduction

1.1 Background

DNA comprises of two chains of nucleotides spiraled around each other, joined together through hydrogen bonds while moving in diverse directions. It has a double-helix structure, a spiral consisting of two DNA chains coiled around each other (Chou & Shen, 2006). Each of the chains possess four complementary nucleotides – adenine (A), cytosine (C), guanine (G) and thymine (T) (Akhtar et al., 2008),(Akhtar et al., 2008),(Akhtar et al., 2007),(Ramachandran et al., 2012) with an A on one chain always matched with T on the other, and C always matched with G (Kinsner, 2010). The structure of DNA was discovered by Francis Crick. This methodology of expressing gene in the field of biomedical sciences is employed to determine human disease structure (Kirk et al., 2018),(Phongwattana et al., 2015). DNA sequencing is an operation of identifying the state of nucleotides in DNA i.e. nucleic acid sequence. It is the process of identifying the physical order of these bases. There a number of techniques for the identification of the order of four bases. Traditionally most of the biologists use Machine Learning (ML) models to resolve their problems like functional genomics, gene-phenotype associations, gene signatures and gene interactions. Previous research recognizes the genes through experimentation on realistic cells, a veracious but costly job. In contrast the present-day work uses machine based approaches to identify the genes due to inherent accuracy driven advantage of these methods. Machine approaches for gene prediction can be categorized as Content-based approaches and Similarity-based approaches (Wang et al., 2004). Similarity-based formulation search for monotony between candidate and public sequence database of existing genes. Similarity-based formulation are computationally costly and miss original genes. Content-based formulation is advanced technique of gene-prediction that overcomes limitations faced by similarity based technique. These approaches use several attribute of sequences like codon utilization, sequence length and GC content. They then employ supervised learning or applied mathematics approaches to predict whether a read comprises of any genes. ML has been evidentially effective to resolve various problem types like classification, regression and clustering.

Complete Article List

Search this Journal:
Reset
Volume 13: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 12: 2 Issues (2023)
Volume 11: 4 Issues (2022)
Volume 10: 4 Issues (2021)
Volume 9: 4 Issues (2020)
Volume 8: 4 Issues (2019)
Volume 7: 4 Issues (2018)
Volume 6: 4 Issues (2017)
Volume 5: 4 Issues (2016)
Volume 4: 4 Issues (2015)
Volume 3: 4 Issues (2014)
Volume 2: 4 Issues (2013)
Volume 1: 4 Issues (2012)
View Complete Journal Contents Listing