Article Preview
TopIntroduction
Bioinformatics research of Long non-coding RNAs (lncRNAs) has attracted much attention in academia and industry because of the important role of gene expression in the genome. lncRNAs are defined as transcripts larger than 200nt in length with limited protein-coding potential. LncRNAs cover a large part of the non-coding information of the human DNA, representing over 90% of the whole genome. Furthermore, recent studies showed that lncRNAs are involved in the pathophysiology in various ways, e.g., gene expression, transcription, and post-translational processing.
The initial lncRNA bioinformatics research mainly focuses on sequence acquisition and data collection, e.g., the functionalities to collect and annotate lncRNAs. However, with the deepening understanding of the datasets, more and more research has been transferred to data analysis and application. For instance, via the hypothesis of “Expression-related genes have a relevant function,” “Interacting molecules have a relevant function,” it is possible to evaluate the similarity between different lncRNAs and thus predict the relationship between lncRNA and corresponding disease. However, there are several pending technical challenges:
- 1.
Feature Extraction Challenge: High-throughput genomics data has specific features, e.g., high dimension features and lack of noted samples. Therefore, the key technical challenge is the exploration of data distribution, characteristic patent, and potential relationships based on prior knowledge from a few labeled samples.
- 2.
Computation Challenge: Under the circumstances of the huge amount of genomics data, the design of computation models, especially lightweight, intelligent, and efficient computation algorithms, is waiting for an urgent answer.
- 3.
Transfer Learning Challenge: In case of data distribution changes (for instance, gene expression data change from one species to another), the seamless transfer from the previous training model to another field is another technical challenge.
This paper investigates lncRNA-related issues and proposes a generic, lightweight, intelligent, and efficient computing model to predict key lncRNA related to disease pathophysiology. There are three contributions in our work:
- 1.
We proposed a Binary PSO-based algorithm for selecting possible lncRNA subsets based on extracted features and logical connections. As a result, it is possible to acquire optimal lncRNA subset via multiple and iterative optimization.
- 2.
ELM-based classification model is imported and implemented to evaluate each lncRNA’s influence on disease. The evaluation result is used to guide future selection preferences.
- 3.
We selected three datasets for experiment and evaluation: breast invasive carcinoma, carcinoma of the colon, and lung adenocarcinoma data. The result shows that our proposed solution achieves 93.6% classification accurate, which is the best.
The rest of the paper is organized as follows. We first present the relationship between lncRNA and disease (especially in the field of cancer), and then describes existing machine learning-based lncRNA research. Next, we introduce the lncRNA data collection, pretreatment, and noise filtering. The next section introduces the proposed BPSO-ML-ELM solution for lncRNA function prediction. We then illustrate the corresponding experiment, evaluation, and discuss the proposed solution in three datasets. Finally, the conclusion and future works are suggested.