Article Preview
TopIntroduction
Due to the increasing number of new cancer cases and deaths, even with the rapid development of medical technology, cancer still seriously threatens human health and is an important cause of human death. The latest estimates for cancer from the International Agency for Research on Cancer (IARC, 2021) show 19.3 million new cases of cancer worldwide and 10 million cancer deaths in 2020. Cancer is expected to surpass cardiovascular disease as the main cause of premature death in most countries in this century. The rapid development of high-throughput technologies such as deep sequencing has enabled the discovery of mass amounts of biological information, which is conducive to better characterizing human diseases and facilitating personalized treatments. In oncology, analysis based on high-throughput biological data sets has discovered new cancer subtypes, which have been used for cancer treatment decisions (Parker et al., 2009; Prasad et al., 2016).
Machine learning technology is widely used in the analysis of bioinformatics data, which can support decision-making and treatment planning for the doctors (Amin et al., 2021; Kumar-Sinha & Chinnaiyan, 2018; Rajinikanth & Kadry, 2021). In order to improve cancer diagnosis and treatment, genomic and other molecular profiles of tumor biopsies have been analyzed for precision tumor therapy. By incorporating gene network interaction, a novel coclustering algorithm has been proposed for identifying cancer subtypes (Liu et al., 2014). However, the role of the human genome is complex and chaotic, and it can regulate biological processes at different levels. The human genome could be revealed by integrating various genomics, such as gene expression, copy number variation, and DNA methylation (Huang et al., 2017). Modern genomic and clinical research urgently needs integrated machine learning models of multiomics data to better utilize large amounts of heterogeneous information to deeply understand biological systems. Multiomics data can obtain information from different perspectives and levels, which is conducive to understanding complex biological systems (Li et al., 2016). The integration and clustering of multiomic data are some of the research hotspots of machine learning in the field of bioinformatics.
To take advantage of local geometrical structures and global structures of the bioinformatics data, a novel multiview clustering method based on nonnegative matrix factorization (NMF) is proposed for cancer subtyping. The local geometrical structures of each omics data set were encoded by generating a nearest neighbor graph. The global structures of a multiomics data set were captured by the sparsity regularized constraints. Then, the unified objective function was used by incorporating local geometrical structures of each omics data set and sparsity regularized common consensus matrix into the NMF-based framework. The novel multiview NMF-based method can obtain the common consensus representation of a multiomics data set, while the sparsity constraints are integrated to handle the noise and outliers in bioinformatics data. Figure 1 illustrates the framework of the unified multiview clustering method. The multiview NMF with graph-regularized and sparsity constraints was integrated to form a unified framework. The final clustering results were gained by spectral clustering. The main contributions are as follows:
- 1.
A unified framework for cancer subtyping by considering the feature of a cancer data set was proposed, which will be useful to identify cancer subtyping in precision medicine that would otherwise be obscured by noise and outliers in bioinformatics.
- 2.
The local geometrical structures and sparsity constraints are incorporated into the multivew clustering process to form a unified objective function for cancer subtyping based on nonnegative matrix factorization.
- 3.
By incorporating the local geometrical structures of each omics data set and the sparsity constraints on a common consensus matrix into the clustering process, Multi-GSNMF provides a unified model and a novel solution to fuse multiview data for clustering.
Figure 1. Framework of the Proposed Algorithm