Matrix Factorization for Gene Expression Recovery in scRNA-seq Data
DescriptionSingle cell RNA sequencing (scRNA-seq) is a powerful gene expression profiling technique, presently revolutionizing the study of complex cellular systems and responses in the biological sciences. However, scRNA-seq approaches currently suffer from sub-optimal target recovery leading to many false negatives and inaccurate measurements. The resulting inflation of null readings adds noise to data visualization and may confound its interpretation. Since cells represent coherent phenotypes defined by conserved molecular circuitries, and since these are encoded in multi-gene expression patterns, information about one node in a multi-cell, multi-gene scRNA-Seq data set is predicted to be embedded in other nodes of the data set. Under this hypothesis, several approaches have been proposed to impute missing values by extracting information from non-zero values in the data set. We hypothesized that recommender systems could provide effective means to recover missing values in RNA-Seq data since they have been widely used to make predictions from sparse data matrices in other fields. In this study, we applied variations of non-negative matrix factorization to produce predicted values for imputation. We compared these approaches to existing imputation approaches and found that smooth NMF (sNMF) and weighted NMF (WNMF) approaches produce significantly better results compared to other approaches, and potentially uncover hidden features in the data.