DIVISION OF DATA SCIENCE EVENTS

If no location or time is stated below, Seminars will start at 3 pm on the following Fridays in Pickard Hall, Room 110:


Title: " Statistics and the Knowledge Economy"

Abstract:
"Statistics came of age when manufacturing was king. But today’s industries are focused on information technology. Remarkably, a lot of our expertise transfers directly. This talk will discuss statistics and AI in the context of computational advertising, autonomous vehicles, large language models, and process optimization."

When: April 19th, 2024
Where: Frances Anne Moody Hall, Southern Methodist University

David Banks
Professor of the Practice of Statistics, Duke University

 


Title: "Frequency Band Analysis of Nonstationary Multivariate Time Series"

Abstract:
"Information from frequency bands in biomedical time series provides useful summaries of the observed signal. Many existing methods consider summaries of the time series obtained over a few well-known, pre-defined frequency bands of interest. However, these methods do not provide data-driven methods for identifying frequency bands that optimally summarize frequency-domain information in the time series. A new method to identify partition points in the frequency space of a multivariate locally stationary time series is proposed. These partition points signify changes across frequencies in the time-varying behavior of the signal and provide frequency band summary measures that best preserve the nonstationary dynamics of the observed series. An L_2 norm-based discrepancy measure that finds differences in the time-varying spectral density matrix is constructed, and its asymptotic properties are derived. New nonparametric bootstrap tests are also provided to identify significant frequency partition points and to identify components and cross-components of the spectral matrix exhibiting changes over frequencies. Finite-sample performance of the proposed method is illustrated via simulations. The proposed method is used to develop optimal frequency band summary measures for characterizing time-varying behavior in resting-state electroencephalography (EEG) time series, as well as identifying components and cross-components associated with each frequency partition point."

When: April 12th, 2024

Raanju Sundararajan 
Assistant Professor in the Department of Statistics and Data Science at Southern Methodist University

 


Title: "Interplay of Linear Algebra, Machine Learning, and High Performance Computing"

Abstract:
"In recent years, we have seen a large body of research using hierarchical matrix algebra to construct low complexity linear solvers and preconditioners. Not only can these fast solvers significantly accelerate the speed of large scale PDE based simulations, but also they can speed up many AI and machine learning algorithms which are often matrix-computation-bound. On the other hand, statistical and machine learning methods can be used to help select best solvers or solvers' configurations for specific problems and computer platforms. In both of these fields, high performance computing becomes an indispensable cross-cutting tool for achieving real-time solutions for big data problems. In this talk, we will show our recent developments in the intersection of these areas. "

When: April 5th, 2024

Xiaoye Sherry Li
Senior Scientist in the Computational Research Division, Lawrence Berkeley National Laboratory

 


Title: "Robust Mendelian Randomization coupled with Alphafold2  for drug target discovery"

Abstract:
" Mendelian randomization (MR) uses genetic variants as instrumental variables (IVs) to infer the causal effect of a modifiable exposure on the outcome of interest by removing unmeasured confounding bias. However, some genetic variants might be invalid IVs due to violations of core IV assumptions. MR analysis with invalid IVs might lead to biased causal effect estimate and misleading scientific conclusions. To address this challenge, we propose a novel MR method that firstSelects valid genetic IVs and then performsPost-selection Inference (MR-SPI) based on two-sample genome-wide summary statistics. We analyze 912 plasma proteins using the large-scale UK Biobank proteomics data in 54,306 participants and identify 7 proteins (TREM2, PILRB, PILRA, EPHA1, CD33, RET, CD55) significantly associated with the risk of Alzheimer’s disease. We employ AlphaFold2 to predict the 3D structural alterations of these 7 proteins due to missense genetic variations, providing new insights into their biological functions in disease etiology. "

When: March 29th, 2024

Zhonghua Liu
Assistant Professor in the Department of Biostatistics at Columbia University

 


Title: "Microbes And Climate Change: Insights From A Grassland Experiment"

Abstract:
"The acceleration of global climate warming, a consequence of the buildup of atmospheric CO2 and other greenhouse gases due to fossil fuel combustion and land use change, represents one of the greatest scientific and policy concerns in the 21st century. Understanding the mechanisms of biospheric feedbacks to climate change is critical to project future climate warming. Although microorganisms catalyze most of biosphere processes related to fluxes of greenhouse gases, the roles of microorganisms in regulating future climate change remain elusive. With time-series data from a long-term climate change experiment at Oklahoma, our results showed that microorganisms play central roles in regulating soil carbon dynamics through three primary feedback mechanisms, climate warming stimulates microbial temporal turnovers and divergent succession, enhances network complexity and stability, but reduces microbial diversity. Our results also demonstrated that incorporating microbial community information significantly improve the predictability of global change models. All these results have important implications in modeling and predicting future climate change, as well as for policy-making. "

When: 12:00 pm March 1st, 2024
Where: EES 100

Jizhong Zhou
George Lynn Cross Research Professor in School of Biological Sciences, Director of Institute for the Environmental Genomics, The University of Oklahoma

 


Title: "Genetic prediction of disease risk across populations"


Abstract:"Polygenic risk score (PRS) has demonstrated its great utility in biomedical research through identifying high risk individuals for different diseases from their genotypes. However, the broader application of PRS to the general population is hindered by the limited transferability of PRS developed in Europeans to non-European populations. To improve PRS prediction accuracy in non-European populations, we develop Bayesian methods that can effectively integrate genome wide association study summary statistics from different populations. Our methods automatically adjust for linkage disequilibrium differences between populations, and characterize the joint distribution of the effect sizes of a variant in different populations to be both null, population specific or shared with correlation. Through simulations and applications to real traits, we show that our methods improve the prediction performance over existing methods in non-European populations. "

When: Feb 23th, 2024

Hongyu Zhao
The Ira V. Hiscock Professor of Biostatistics at School of Public Health, Yale University

 


Title: "Bridging the Gap: From AI Research to Clinical Application in Medicine"

Abstract:"This talk provides an extensive overview of the challenges and potential solutions in implementing AI in clinical medicine. It traces the transition from research to real-world application, focusing on critical aspects such as model generalizability, commissioning, and performance deterioration over time, etc. The presentation also addresses the integration of AI into existing medical workflows and the adaptation to different physician styles, highlighting the importance of a system-centric approach. Emphasizing the need for real-world adaptability and clinician engagement, this talk aims to shed light on developing AI tools that are not only technologically advanced but also seamlessly integrated and functional in diverse clinical settings."

When: Feb 2nd, 2024

Steve B. Jiang
Barbara Crittenden Professor in Medical Artificial Intelligence and Automation Lab and Department of Radiation Oncology, University of Texas Southwestern Medical Center

 


Title: "BART: The Remarkable Flexibility of a Bayesian Ensemble of Trees"

Abstract:"Motivated by ensemble methods in general, and gradient boosting in particular, BART (Bayesian Additive Regression Trees) is a Bayesian nonparametric regression approach for the discovery of the underlying relationship between Y and a multidimensional vector x. Approximating the conditional mean E[Y|x] with a sum of regression trees, BART is built on a statistical model: a likelihood combined with a prior that regularizes the trees to be dimensionally adaptive weak learners. Fitting and inference are accomplished with rapidly mixing Bayesian backfitting MCMC algorithms that enable full posterior inference, including point and interval estimates of E[Y|x], as well as model-free variable selection. To further illustrate the modeling flexibility of a Bayesian ensemble of trees, we also consider two BART elaborations: MBART and HBART. Exploiting potential monotonicities of E[Y|x], MBART incorporates a basis of multivariate monotone trees, thereby enabling the discovery and estimation of decompositions of the directions of E[Y|x] into their unique monotone increasing and decreasing components. To detect and mitigate the possible presence of heteroscedasticity, HBART incorporates an additional product-of-trees model component for the conditional variance, thereby enabling simultaneous inference about both E[Y|x] and Var[Y|x]. (This is joint research with H. Chipman, M. Pratola, R. McCulloch and T. Shively)."

When: Jan 26th, 2024

Edward I. George
Universal Furniture Professor Emeritus of Statistics and Data Science, The Wharton School University of Pennsylvania

 


Title: "pan-MHC and cross-Species Prediction of T Cell Receptor-Antigen Binding with pMTnet-omni"

Abstract:"Profiling the binding of T cell receptor (TCR) of T cells towards antigenic peptides presented by MHC proteins is one of the most important unsolved problems in modern immunology. Traditional experimental methods to probe TCR-antigen interactions are slow, labor-intensive, costly, and low- to middle-throughput. To address this problem, we developed pMTnet-omni, an Artificial Intelligence (AI) system based on hybrid protein sequence and structure information, to predict the pairing of TCRs of αβ T cells with peptide-MHC complexes (pMHCs). pMTnet-omni is capable of handling peptides presented by both class I and II pMHCs, and capable of handling both human and mouse TCR-pMHC pairs. pMTnet-omni achieves a high overall Area Under the Curve of Receiver Operator Characteristics (AUROC) of 0.89, which surpasses competing tools by a large margin. We showed that pMTnet-omni can distinguish binding affinity of TCRs with similar sequences. Across a range of datasets from various biological contexts, pMTnet-omni characterized the longitudinal evolution and spatial heterogeneity of TCR-pMHC interactions and their functional impact. We successfully built a biomarker based on pMTnet-omni for predicting immune-related adverse events of immune checkpoint inhibitor (ICI) treatment in a cohort of 57 ICI-treated patients. pMTnet-omni represents a large step closer to a clinically usable AI system for TCR-pMHC pairing prediction that can aid the design and implementation of TCR-based immunotherapeutics."

When: Dec 1st, 2023

Tao Wang
Associate Professor, UT Southwestern Medical Center

 


Title: "Recasting Computer Science Problems in Data Science"


Abstract: "One of my favorite quotes in the cyber world is “In this day and age, either you are going to touch a computer or you are going to be touched by it”. A corollary in the data world is “Either you are going to produce data or you are going to consume it”. There lies the intertwined worlds of computer science and data science. However, in spite of obvious overlaps, fundamental objectives and methodologies in computer science and data science differ significantly. Computer science is built on universal principles to study and solve problems in computation, algorithms, and abstraction. The common approach is to create general-purpose solutions through algorithmic design(s). In contrast, data science is more empirical in nature, founded on data-driven exploration, statistical analysis, and machine learning for understanding and solving problems. With the availability of large amounts of data in most domains and tremendous advancements in computing capabilities, several problems in computer science have been recast as data science problems for more efficient solutions. In this talk we will present data- driven solution transitions for some of the problems in our research domains including Cyber Security, System Reliability, Computer and telecom networks, and Human Machine Interfaces."

When: November 10th, 2023

Suku Nair
Vice Provost for Research & Chief Innovation Officer Director, AT&T Center for Virtualization, Southern Methodist University


Title: "Doubly Flexible Estimation under Label Shift"

Abstract: "In studies ranging from clinical medicine to policy research, complete data are usually available from a population P, but the quantity of interest is often sought for a related but different population Q which only has partial data. In this paper, we consider the setting that both outcome Y and covariate X are available from P whereas only X is available from Q, under the so-called label shift assumption, i.e., the conditional distribution of X given Y remains the same across the two populations. To estimate the parameter of interest in population Q via leveraging the information from population P, the following three ingredients are essential: (a) the common conditional distribution of X given Y , (b) the regression model of Y given X in population P, and(c) the density ratio of the outcome Y between the two populations. We propose an estimation procedure that only needs some standard nonparametric regression technique to approximate the conditional expectations with respect to (a), while by no means needs an estimate or model for (b) or (c); i.e., doubly flexible to the possible model misspecifications of both (b) and (c). This is conceptually different from the well-known doubly robust estimation in that, double robustness allows at most one model to be mis-specified whereas our proposal here can allow both (b) and (c) to be mis-specified. This is of particular interest in our setting because estimating (c) is difficult, if not impossible, by virtue of the absence of the Y -data in population Q. Furthermore, even though the estimation of (b) is sometimes off-the-shelf, it can face curse of dimensionality or computational challenges. We develop the large sample theory for the proposed estimator and examine its finite-sample performance through simulation studies as well as an application to the MIMIC-III database."

When: November 17th, 2023

Yanyuan Ma
Professor, The Pennsylvania State University



Title: "Omics and data science in public health"

Abstract: "Dr. Baccarelli will present methods and results from recent and ongoing studies using molecular biology approaches coupled to data science to identify individuals that are more impacted or susceptible to harmful environmental exposures. He will introduce a cadre of methods ranging from epigenome-wide DNA methylation to exosome/extracellular vesicles as promising new paths to enhance understanding of the effects of environmental exposures on human health. Over the past few years, the application of contemporary machine learning methods to epigenomics, specifically to DNA methylation data, has shown that DNA methylation can provide accurate fingerprints of environmental factors, including tobacco smoking, environmental chemicals, and lifestyle. Those fingerprints reflect current exposure, but they also correlate well with past and cumulative exposure. Many investigators have compared the epigenome to a recording device built in our cells that captures both external and internal conditions. Using this framework provides untapped opportunities to identify the impact of risk factors at the individual level, as well as new approaches for risk stratification and personalized prevention. In this presentation, Dr. Baccarelli will review current evidence from recent studies and potential contributions to human health and disease. He will discuss data sources, methodological challenges for large human studies, limitations, and possible future directions."

When: October 20th, 2023 12pm - 1pm

Location: EES 100

Andrea Baccarelli MD, PhD
Leon Hess Professor and Chair, Department of Environmental Health Sciences Columbia University



Title: "Cancer prognosis analysis via integrating molecular and histopathological imaging features"

Abstract: "Modeling cancer prognosis is a “classic” yet still challenging problem. In the past two decades, high- throughput molecular data have been extensively used in such analysis. Very recently, it has been shown that histopathological imaging features, which are generated in the biopsy process, are also informative for modeling prognosis (and other outcomes/phenotypes). Molecular and imaging data contain overlapping as well as independent information. In our recent studies, we have developed regularization techniques, testing the degree of independent information for prognosis and integrating the two distinct types of data for prognosis modeling under homogeneity as well as heterogeneity."

When: September 22nd, 2023

Shuangge Ma, Ph.D.
Professor and Chair, Biostatistics Department, Yale University School of Public Health