Current Environment: Production

Completed projects

Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model.

A figure shows a workflow of peak quality assessment. Starting from epigenomic data such as DNase-seq, ATAC-seq, and ChIP-seq, peaks are sorted by their signal strengths and then grouped into subsets of 5,000 peaks. Next, each subset as positive data is used to train gkm-SVM model against random genomic regions as negative data. Peak subset scores are calculated as the area under the ROC curve (AUC) using cross-validation. A gkmQC curve is defined as the rank of peak subsets on the X-axis.

Chromatin accessibility assays are commonly used to identify regulatory elements in the genome that are associated with gene transcription. However, the quality of the data produced by these assays can vary significantly due to a range of biological and technical factors. To address this problem, we developed a machine learning method called gapped k-mer SVM quality check (gkmQC) to evaluate the quality of chromatin accessibility data using prediction accuracy as a metric. Using this method, we were able to identify "high-quality" samples that were more accurately aligned with functional regulatory elements and showed stronger associations with tissue-specific phenotypes. Additionally, gkmQC was able to optimize the peak-calling threshold for identifying additional regulatory elements, particularly in rare cell types. Overall, this study provides a useful tool for assessing the quality of chromatin accessibility data and identifying more reliable regulatory elements in the genome.

Ongoing projects

Constructing Gene Regulatory Networks Using Kidney Multiome Data

The top plot shows links connecting regulatory elements and their potential target genes determined by the correlation between the gene expression and regulatory element activities across the cell. Correlation strength is shown as link heights in the y-axis. The middle plot shows Genes. In this region, the WT1 gene is displayed. The bottom track shows “chromatin accessibility” as the proportion of cells in clusters with peaks in ten different cell types.

The aim of this project is to construct a detailed map of the gene regulatory networks in the human kidney. A “gene regulatory network” is a map of how genes interact with each other to carry out the functions of an organism. By understanding how these networks operate, we can better understand the complex cellular processes and how they may be disrupted in diseases. To build this map, we will use single-cell Multiome data from kidney tissue, which simultaneously measures chromatin accessibility and gene expression in each cell genome-wide. This allows us to study gene expression at a fine-grained, cell-type-specific level. By analyzing this data, we can infer the gene regulatory networks that are active in different cell types within the kidney and potentially identify new targets for therapeutic intervention in kidney diseases.

Developing machine-learning models for improved regulatory variant prediction 

A diagram shows the workflow of ensemble model building. Starting with Epigenomic data, baseline models, such as deltaSVM, deepSEA, and Basenji, are built. The data are also used to identify regulatory variants. These two components are used to build ensemble models, which perform classification using majority voting and averaging techniques.

Regulatory variants are changes in the DNA sequence that can alter the expression of genes, potentially leading to differences in traits or the risk of diseases. In this project, we aim to develop more accurate computational models to predict the effects of regulatory variants using single-cell chromatin accessibility data and machine-learning techniques. Chromatin accessibility refers to the degree to which the genomic DNA is open. These “open” DNA can then be accessed by the transcription machinery to modulate target genes’ expression. We will employ multiple different strategies, including ensemble models, which are collections of multiple models that work together to make more accurate predictions than any single model could on its own.

Developing new strategies to combine multiple cohorts for genome-wide association studies

Cohort A and Cohort B flow chart graph that describes the two different ways to get to imported genotypes.

Genome-wide association studies (GWASs) are a powerful tool for identifying genetic variants associated with specific traits or diseases. One of the challenges of GWAS is that it can be difficult to get enough data to have sufficient power to identify these genetic associations, especially for rare traits or diseases. A potential solution is to combine data from multiple cohorts (groups of individuals) in a single GWAS. By pooling data from multiple studies, researchers can increase the sample size and potentially increase the accuracy of the results. However, combining data from multiple studies can also introduce additional challenges, such as differences in the data collection methods or the characteristics of the study populations. This project aims to identify the best ways to overcome these challenges and effectively combine data from multiple studies in a single GWAS. If successful, our new method will empower researchers to identify genetic associations more accurately and better understand the genetic basis of various traits and diseases.

Experimental Identification of Regulatory Variants Using Massively Parallel Reporter Assay

A massively parallel reporter assay (MPRA) is a sequencing-based technique to study the activity of regulatory elements in a high-throughput manner. In collaboration with Dr. Ashish Kapoor’s Laboratory at The University of Texas Health Science Center, we employ MPRA to validate regulatory variants predicted by our computational models. We also use MPRA to discover genetic variants associated with human traits that have regulatory activities. Specifically, we are evaluating common variants associated with heart and kidney function.