Our main research interests are in machine learning and bioinformatics. The overarching goal is to develop novel computational methods for advancing biological discoveries. A few key words of our research include machine learning, big data, manifold learning, data integration genomics, and single-cell analytics. The following are some brief descriptions of our ongoing projects:
- Single-cell RNA sequencing (scRNA-seq) is a powerful technology capable of unveiling cellular heterogeneity of the transcriptome at single-cell resolution, producing insights toward subpopulation structures and progression trajectories which would be hidden in bulk cell population RNA sequencing analyses. scRNA-seq experiments often generate large amounts of data, containing whole-genome gene expression measurements of thousands or more individual cells. There are several reasons why computational analysis of scRNA-seq data is challenging, such as high dimensionality, measurement noise, detection limit, unbalanced size between rare and abundant populations, etc. One important characteristic here is that scRNA-seq data is highly sparse, often referred to as “dropout”. The excessive zero counts cause the data to be zero-inflated, only capturing a small fraction of the transcriptome of each cell. Many algorithms have been developed to analyze scRNA-seq data and computationally tackle the dropout problem. In contrast to the majority of the literature that treat dropouts as a problem that needs to be fixed, we are developing a co-occurrence clustering framework, which embraces dropouts as a useful signal for the purpose of clustering cells. Click here to see a preprint of this work co-occurrence clustering bioRxiv manusrcript.
- The Cancer Genome Atlas (TCGA) is a valuable data resource for cancer research. TCGA provides multi-omics and clinical data of ~11,000 patients across 33 cancer types. The omics data include mutation, copy number, gene expression, microRNA, methylation, protein expression, etc. The clinical data include cancer stage, survival, drug treatment, and many other clinical features. We have been working on integrating omics, survival, and drug treatment data, for the purpose of identifying gene drug interactions (example). We are also working on examining the correlations among various omic aspects (example).
- Flow cytometry and the next-generation mass cytometry technologies capture the heterogeneity of biological systems by providing multiparametric measurements of single cells. Even as cytometry technology is rapidly advancing, methods for analyzing this complex data lag behind. Traditional flow cytometry analysis is often a subjective and labor-intensive process that requires user deep understanding of the cellular phenotypes underlying the data. Furthermore, the advent of mass cytometry is quickly increasing the dimensionality of the data, making the traditional analysis approaches a critical bottleneck. We developed a novel analytical approach, Spanning-tree Progression Analysis of Density-normalized Events (SPADE), to objectively analyze single-cell data in a robust and unsupervised manner. Briefly, SPADE views a single-cell cytometric dataset as a high-dimensional point cloud of cells, and uses topological methods to reveal the geometry of the cloud. Based on preliminary data, this geometry reveals distinct subpopulation of cells and a likely cellular hierarchy underlying the data. (Click here to download the SPADE software).
- Mathematical modeling is an important tool for understanding complex biological processes. Typically, mathematical models of biological systems are highly complex with a large number of unknown parameters, whereas the amount of experimental data is almost always limited, not enough to constrain the parameters. As a result of this information gap between the model complexity and the data, parameter estimation and analysis are ill-posed and very challenging problems. To close this information gap, two intuitive strategies are Experimental Design (obtain more data) and Model reduction (simplify the model). We are working on a unified computational framework and geometric interpretation for both problems. We consider a mathematical model as a manifold living in a high-dimensional data space, and explore the projections and singularities of the manifold to perform experimental design and model reduction.
- Spatial partitioning and localization of biological functions is a phenomenon fundamental to life. At the cellular level, proteins function at specific times and locations. These subcellular locations provide a specific chemical environment and context that are necessary to fulfill the protein function. Thus, knowledge of the spatial distribution of proteins at a subcellular level is essential for understanding protein function, interactions, and cellular mechanisms. We have been developing algorithms for predicting protein subcellular localization using fluorescence microscopy images. We participated in the CYTO 2017 Image Analysis Challenge which focused on this topic, and we achieved top prediction accuracy. Click here to view examples images in the data, and more info regarding the challenge.
- The majority of microarray data analysis methods in the literature focus on identifying difference between sample groups (normal vs. cancer, treated vs. control), i.e. unsupervised clustering, supervised classification and various forms of statistical tests. These methods are essentially asking the same question, what is the difference between group A and group B. The differences among samples within the same group have been ignored. To explore this information, we developed a new computational method, termed Sample Progression Discovery (SPD). SPD aims to identify an underlying progression among individual samples, both within and across sample groups. We view SPD as a hypothesis generation tool when applied to datasets where the progression is unclear. For example, when applied to a microarray dataset of cancer samples, SPD assumes that the cancer samples collected from individual patients represent different stages during an intrinsic progression underlying cancer development. The inferred relationship among the samples may therefore indicate a trajectory or hierarchy of cancer progression, which serves as a hypothesis to be tested. (Click here to download the SPD software).
- Classification methods are commonly divided into two categories: unsupervised versus supervised. Because the class label information is not involved in unsupervised methods, they have the ability to discover new classes. However, they carry the risk of producing non-interpretable results. On the other hand, supervised methods will always find a decision rule that interprets the different classes. However, in supervised methods, the class label information plays such an important role that it confines the supervised methods by defining the number of possible classes. Consequently, supervised methods do not have the ability to discover new classes. The limitations of unsupervised and supervised methods motivated us to propose a semi-supervised classification method, which utilizes the class label information to a less important role so as to perform class discovery and classification simultaneously. (Click here to download code).