STAT & Data Science Seminars

If no location or time is stated below, Seminars will start at 3 pm on the following Fridays in Pickard Hall, Room 110:


Title: "Cancer prognosis analysis via integrating molecular and histopathological imaging features"

Abstract: "Modeling cancer prognosis is a “classic” yet still challenging problem. In the past two decades, high- throughput molecular data have been extensively used in such analysis. Very recently, it has been shown that histopathological imaging features, which are generated in the biopsy process, are also informative for modeling prognosis (and other outcomes/phenotypes). Molecular and imaging data contain overlapping as well as independent information. In our recent studies, we have developed regularization techniques, testing the degree of independent information for prognosis and integrating the two distinct types of data for prognosis modeling under homogeneity as well as heterogeneity."

When: September

Shuangge Ma, Ph.D.
Professor and Chair, Biostatistics Department, Yale University School of Public Health 


Title: "Omics and data science in public health"

Abstract: "Dr. Baccarelli will present methods and results from recent and ongoing studies using molecular biology approaches coupled to data science to identify individuals that are more impacted or susceptible to harmful environmental exposures. He will introduce a cadre of methods ranging from epigenome-wide DNA methylation to exosome/extracellular vesicles as promising new paths to enhance understanding of the effects of environmental exposures on human health. Over the past few years, the application of contemporary machine learning methods to epigenomics, specifically to DNA methylation data, has shown that DNA methylation can provide accurate fingerprints of environmental factors, including tobacco smoking, environmental chemicals, and lifestyle. Those fingerprints reflect current exposure, but they also correlate well with past and cumulative exposure. Many investigators have compared the epigenome to a recording device built in our cells that captures both external and internal conditions. Using this framework provides untapped opportunities to identify the impact of risk factors at the individual level, as well as new approaches for risk stratification and personalized prevention. In this presentation, Dr. Baccarelli will review current evidence from recent studies and potential contributions to human health and disease. He will discuss data sources, methodological challenges for large human studies, limitations, and possible future directions."

When: October

Andrea Baccarelli MD, PhD
Leon Hess Professor and Chair Department of Environmental Health Sciences Columbia University


Title: "TBD"

Abstract: "TBD"

When: November

Tao Wang
UT Southwestern Medical Center


Title: "Doubly Flexible Estimation under Label Shift"

Abstract: "In studies ranging from clinical medicine to policy research, complete data are usually available from a population P, but the quantity of interest is often sought for a related but different population Q which only has partial data. In this paper, we consider the setting that both outcome Y and covariate X are available from P whereas only X is available from Q, under the so-called label shift assumption, i.e., the conditional distribution of X given Y remains the same across the two populations. To estimate the parameter of interest in population Q via leveraging the information from population P, the following three ingredients are essential: (a) the common conditional distribution of X given Y , (b) the regression model of Y given X in population P, and(c) the density ratio of the outcome Y between the two populations. We propose an estimation procedure that only needs some standard nonparametric regression technique to approximate the conditional expectations with respect to (a), while by no means needs an estimate or model for (b) or (c); i.e., doubly flexible to the possible model misspecifications of both (b) and (c). This is conceptually different from the well-known doubly robust estimation in that, double robustness allows at most one model to be mis-specified whereas our proposal here can allow both (b) and (c) to be mis-specified. This is of particular interest in our setting because estimating (c) is difficult, if not impossible, by virtue of the absence of the Y -data in population Q. Furthermore, even though the estimation of (b) is sometimes off-the-shelf, it can face curse of dimensionality or computational challenges. We develop the large sample theory for the proposed estimator and examine its finite-sample performance through simulation studies as well as an application to the MIMIC-III database."

When: November

Yanyuan Ma
The Pennsylvania State University