DIVISION OF DATA SCIENCE EVENTS

Unless specified below, all seminars will start at 3:30 pm Fridays in Pickard Hall, Room 110:

Title: "When Memory Learns to Compute: A New Era of AI Acceleration"

When: Nov 21th, 2025

Muhammad Rashed
Assistant Professor in the Department of Computer Science and Engineering at the University of Texas at Arlington

Abstract: "The exponential growth in the availability of digital data has powered the emergence of data-driven applications like large language models, computer vision, and digital twin. These applications have incredibly high computing demands that exceed the capabilities of today's high-performance computing systems. Unfortunately, the limitations of scaling silicon technology, and the von-Neumann bottleneck suggest that these demands cannot be addressed through traditional means. To address the pressing challenge of computational scalability and efficiency, new computing paradigms are being explored. One promising solution to this computational challenge is to perform in-memory computation using emerging non-volatile memory devices. This approach enables energy efficient execution of computationally expensive operations and promises substantial improvements in throughput. However, this technology is still in its early stage. To fully unleash the promises of in-memory computation systems, we need tailored solutions throughout the entire computing stack- from algorithm to device physics. This talk will explore the promises, challenges, and future directions of in-memory computing systems in enabling next-generation AI."

Title: "Identifiability of the minimum-trace directed acyclic graph without strict local optima under weakly increasing error variances"

When: Nov 14th, 2025

Hyunwoong (Woody) Chang
Assistant Professor of Statistics in the Department of Mathematical Sciences at University of Texas at Dallas

Abstract: "We prove that the true underlying directed acyclic graph (DAG) in Gaussian linear structural equation models is identifiable as the minimum-trace DAG when the error variances are weakly increasing with respect to the true causal ordering. This result bridges two existing frameworks as it extends the identifiable cases within the minimum-trace DAG method and provides a principled interpretation of the algorithmic ordering search approach, revealing that its objective is actually to minimize the total residual sum of squares. On the computational side, we prove that the hill climbing algorithm with a random-to-random (R2R) neighborhood does not admit any strict local optima. Under standard settings, we confirm the result through extensive simulations, observing only a few weak local optima. Interestingly, algorithms using other neighborhoods of equal size exhibit suboptimal behavior, having strict local optima and a substantial number of weak local optima."

Title: "Spatial deconvolution and cell type-specific spatially variable gene detection in spatial transcriptomics"

When: Oct 17th, 2025

Yuehua Cui
Professor and Graduate Program Director in the Department of Statistics and Probability at Michigan State University

Abstract: "Spatial transcriptomics (ST) provides crucial insights into tissue-specific gene expression patterns in cancer research, with identifying spatially variable genes (SVGs) being a key analytical task. Most ST data, such as those from the 10x Visium platform, capture gene expression at the spot level, where measurements reflect signals from multiple cells of varying types. Deconvolution of these multi-cellular signals is essential for accurately inferring cell type compositions, which play a crucial role in downstream analyses. Many SVGs correlate with cell type compositions, leading to the concept of cell type-specific SVGs (ctSVGs). However, not all SVGs are ctSVGs, and vice versa, as some genes exhibit random spatial patterns within cell types, complicating detection. In this talk, I will first introduce some of our recent work on spatial deconvolution, then present STANCE, a unified statistical model designed for detecting both SVGs and ctSVGs under a linear mixed-effects framework that integrates gene expression, spatial location, and cell type composition information. STANCE ensures rotation-invariant results using a two-stage approach, first detecting SVGs/ctSVGs and then conducting ctSVG-specific testing. We demonstrate its robust performance through extensive simulations and public dataset analyses, showcasing its potential to enhance spatial transcriptomics research."

Title: "Healthcare and Public Health Monitoring and Management"

When: Spet 19th, 2025

Kwok Tsui
Professor in the Department of Industrial, Manufacturing, and Systems Engineering at the University of Texas-Arlington

Abstract: "Due to the advancement of computation power, sensing technologies, and data collection tools, the field of healthcare and public health monitoring and management have been evolved over the past several decades with different names under different application domains, such as statistical process control (SPC), process monitoring, health surveillance, prognostics and health management (PHM), personalized medicine, etc. There are tremendous opportunities in interdisciplinary research of system monitoring through integration of SPC, system informatics, data analytics, PHM, and personalized health management. In this talk we will present our views and experiences in the evolution of systems health monitoring and management, its challenges and opportunities, as well as its applications in both healthcare surveillance and public health management."

Title: "Optimization Using Model Predictive Control Combined with iLQR and Neural Networks"

When: Spet 12th, 2025. (Jointly host by the Division of Data Science and Department of Mathematics)

Dao Nguyen
Assistant Professor in the Department of Mathematics and Statistics at San Diego State University

Abstract: "This talk is devoted to combining model predictive control (MPC) and deep learning methods, specifically neural networks, to solve high-dimensional optimization and control
problems. MPC is a popular method for real-life process control in various fields, but its computational requirements can often become a bottleneck. In contrast, deep learning algorithms have shown effectiveness in approximating high-dimensional systems and solving reinforcement learning problems. By leveraging the strengths of both MPC and neural networks, we aim to improve the efficiency of solving MPC problems. The talk also discusses the optimal control problem in MPC and how it can be divided into smaller time horizons to reduce computational costs. Additionally, we focus on enhancing MPC through two approaches: a machine learning–based feedback controller and a machine learning–enhanced planner, which involve implementing neural networks and iLQR. Overall, this talk provides insights into the potential of combining MPC and deep learning methods to tackle complex control problems across various fields, with applications to robotics."

SPRING 2025

Title: "Beauty and the Beast: Data Science for Health Research"

When: April 11th, 2025

Eun-Young (E.Y.) Mun
Regents Professor and Associate Dean for Research and Innovation in the College of Public Health, University of North Texas Health Science Center

Abstract: "I will discuss how my research has evolved through interdisciplinary collaborations with experts across various fields. Innovation often emerges when we encounter new ideas and explore “what-ifs.” However, transforming these possibilities into actionable insights and meaningful outcomes requires long-term investment in collaboration and a commitment to learning from one another. In many ways, all my work has been data science projects where domain experts and statisticians come together. One such project is Project INTEGRATE (2010 – present), an R01 project that began with a straightforward idea: combining data sets from multiple studies to leverage a larger, more diverse sample and examine comparative effectiveness and mechanisms. While conceptually simple, the project presented significant challenges, pushing me to seek collaborations and broaden my perspective. As a result, our team has published 68 papers—some in statistical journals, others in clinical journals. This body of work has also led to other funded projects, including the recent R25 project that focuses on training the next generation of addiction scientists in data science. I will share specific examples of our work, highlighting challenges and solutions. I will then discuss the importance of advancing research incrementally in the short term while fostering innovation in the long term. Like in Beauty and the Beast, progress comes from the synergy of different strengths, and collaboration transforms challenges into breakthroughs."

Title: "Predicting the risk from volcanic flows and mudslides"

When: March 7th, 2025

James Berger
Arts and Sciences Distinguished Professor Emeritus of Statistics, Duke University

Abstract: "Predicting risks from dangerous flows can be done using a combination of computer modeling, statistical modeling and probability analysis. The modeling involves Gaussian Processes and is required to be able to handle up to 10^9 data points. The analysis will be illustrated using a risk assessment from the recent Soufriere Hills volcanic eruption on the island of Montserrat."

Title: "Scrambled Signals in the Noise: Navigating Heteroscedastic Measurement Errors"

When: February 21th, 2025

Nelis Potgieter
Associate Professor of Statistics, Department of Mathematics, Texas Christian University

Abstract: "Measurement error occurs when observed values deviate from the true values of interest. These deviations can result from issues with measurement instruments, user error, or environmental factors. For example, in blood pressure measurements, both the type of device used and the technician’s level of expertise can lead to inaccurate or “contaminated” readings. This seminar will address the challenges of parameter estimation in the presence of measurement error. We will focus on heteroscedasticity, where the magnitude of the errors can vary across observations. We will introduce a new estimation approach for linear errors-in-variables models, which are regression models where the predictor variable is subject to measurement error. Our method handles situations where the measurement error distribution is both unknown and heteroscedastic. This approach leverages the phase function, a lesser-known but powerful statistical tool, which will be explained in detail. By combining a moment correction technique with a phase function-based method, we offer a comprehensive solution. We will demonstrate the consistency and asymptotic normality of this new estimator and showcase its strong performance in finite samples, particularly when measurement errors deviate from the normal distribution.
The seminar will balance statistical rigor with illustrations and examples, equipping you with new tools to unscramble the noise and better detect the true signal in your data."

Title: "Bayesian Scalable Precision Factor Analysis for Gaussian Graphical Models"

When: February 14th, 2025

Noirrit Kiran Chandra
Assistant Professor of Statistics, Department of Mathematical Sciences, The University of Texas at Dallas

Abstract: "We propose a novel approach to estimating a multivariate Gaussian precision matrix that relies on decomposing them into a low-rank and a diagonal component. Such decompositions are very popular for modeling large covariance matrices as they admit a latent factor based representation that allows easy inference. The same is however not true for precision matrices due to the lack of computationally convenient representations which restricts inference to low-to-moderate dimensional problems. We address this remarkable gap in the literature by building on a latent variable representation for such decomposition for precision matrices. The construction leads to an efficient Gibbs sampler that scales very well to high-dimensional problems far beyond the limits of the current state-of-the-art. The ability to efficiently explore the full posterior space also allows easy assessment of model uncertainty. Exact zeros in the matrix encoding the underlying conditional independence graph are then determined via a novel posterior false discovery rate control procedure. A near minimax optimal posterior concentration rate for estimating precision matrices is attained by our method under mild regularity assumptions. We evaluate the method’s empirical performance through synthetic experiments and illustrate its practical utility. We then extend the model to arbitrary non-Gaussian distributed data with autocorrelations using a matrix-Gaussian copula approach for a novel application in resting state functional connectivity analysis is the auditory subcortical region of the human brain."

Title: "Stage-Aware Learning for Dynamic Treatments"

When: January 31st, 2025

Annie Qu
Chancellor’s Professor, Department of Statistics, University of California, Irvine

Abstract: "Recent advances in dynamic treatment regimes (DTRs) provide powerful optimal treatment searching algorithms, which are tailored to individuals’ specific needs and able to maximize their expected clinical benefits. However, existing algorithms could suffer from insufficient sample size under optimal treatments, especially for chronic diseases involving long stages of decision-making. To address these challenges, we propose a novel individualized learning method which estimates the DTR with a focus on prioritizing alignment between the observed treatment trajectory and the one obtained by the optimal regime across decision stages. By relaxing the restriction that the observed trajectory must be fully aligned with the optimal treatments, our approach substantially improves the sample efficiency and stability of inverse probability weighted based methods. In particular, the proposed learning scheme builds a more general framework which includes the popular outcome weighted learning framework as a special case of ours. Moreover, we introduce the notion of stage importance scores along with an attention mechanism to explicitly account for heterogeneity among decision stages. We establish the theoretical properties of the proposed approach, including the Fisher consistency and finite-sample performance bound. Empirically, we evaluate the proposed method in extensive simulated environments and a real case study for COVID-19 pandemic."

FALL 2024

Title: "Generalised Long Memory Time Series and Applications: An Overview"

When: November 15th, 2024

Shelton Peiris
Visiting Professor at the Department of Statistics, UC Davis

Abstract: "Analysis of long memory time series became very popular among the theoretical and applied researchers in the last 2-3 decades due to its flexibility in many applications in almost every field. In this paper, a particular attention has been paid to the development of Generalised Long Memory time series generated by Gegenbauer polynomials and Autoregressive Moving Average (ARMA) models. Several estimation methods will be discussed and recent applications in various fields will be presented. A multivariate or vector extension to GARMA family (ie. Vector GARMA or VEGARMA) will be introduced along with the relevant theoretical properties and applications. "

Title: "Statistical Modeling of Topological Features in Medical Imaging: Enhancing Prognostic Precision and Interpretation"

When: November 8th, 2024

Chul Moon
Associate Professor in the Department of Statistics and Data Science at SMU

Abstract: "Tumor shape significantly influences growth and metastasis. We introduce a topological feature obtained by persistent homology to characterize tumor progression in pathology and radiology images, focusing on its influence on time-to-event data. These topological features, invariant to scale-preserving transformations, capture diverse tumor shape patterns. We introduce a functional spatial Cox proportional hazards model that represents these topological features in a functional space, utilizing them as functional predictors alongside their spatial locations. This model allows for interpretable analysis of the relationship between topological shape features and survival risks."

Title: "An epigenetic view of aging dynamics"

When: November 1st, 2024

Feng Gao
Assistant Professor in the Department of Environmental Health Sciences and Department of Molecular and Medical Pharmacology at UCLA

Abstract: "Aging is one of the most important risk factors for many diseases, and is a complex process that involves multiple factors from genetics to environmental and lifestyle factors. In the meanwhile, aging-related epigenetic changes such as DNA methylation can participate in the regulation of the aging process. Therefore, understanding the epigenetic mechanisms including the dynamics in aging will provide new insights into developing new approaches for disease prevention. Indeed, the rapid development of DNA methylation analysis has provided rich information about epigenetic regulations. For example, DNA methylation analysis can measure more than 450K and 850K CpG sites through microarray technology. These data provide great opportunities to study aging, however, also pose great challenges in learning useful information from high dimensional data. In this talk, I’ll share our recent research on aging. Specifically, I’ll talk about how we leverage novel computational models to reveal biological patterns and decipher the complex information embedded in high dimensional epigenetics data to study aging. I’ll discuss our findings about the dynamics of aging process. Finally, I’ll talk about our future directions in leveraging multi-omics data for aging studies."

Title: "Some Issues and Challenges in the analysis of Biomedical Data"

When: October 25th, 2024

Zhezhen Jin
Professor of Biostatistics in the Department of Biostatistics in Mailman School of Public Health at Columbia University

Abstract: "It is essential to incorporate basic statistical principles and ideas in data analysis. In the analysis of biomedical data, it is often encountered to compare and identify biomarkers that are more informative to disease diagnosis and monitoring, and to evaluate various treatment procedure and plan on health outcome. After a discussion on the issues and challenges with some real examples, I will review available statistical methods and present our newly developed semiparametric statistical methods that are useful for item reduction, differentiation of significant exposure factors and high dimensional data analysis."

Title:

"3D reconstruction of spatial transcriptomics with spatial pattern enhanced graph convolutional neural network"

When: October 4th, 2024

Lin Xu
Assistant Professor in the Department of Health Data Sciences and Biostatistics at SPH, the Department of Pediatrics at Medical School, and a member of the Quantitative Biomedical Research Center (QBRC), and the Harold C. Simmons Cancer Center at UT Southwestern Medical Center

Abstract: "Existing statistical and deep learning algorithms used for analyzing spatially resolved transcriptomics (SRT) data rely solely on two-dimensional (2D) spatial coordinates, which limits their ability to accurately identify three-dimensional (3D) spatial patterns. To address this limitation, we introduced Spa3D, which utilized anti-leakage Fourier transform and graph convolutional neural network model to reconstruct 3D-based spatial structure. We demonstrate that Spa3D is appliable to analyze data from various SRT technology platforms and outperforms state-of-art methods on elucidating 3D-based spatial domains, cell-cell communication, organ-level tempo-spatial development patterns, and 3D spatial trajectory that are not captured by 2D spatial coordinates."

Title: "A Meta-analysis based Hierarchical Variance Model for Powering One and Two-sample t-tests"

When: September 20th, 2024

Jackson Barth
Assistant Professor in the department of Statistical Science at Baylor University

Abstract: "Sample size determination (SSD) is essential in statistical inference and hypothesis testing, as it directly affects the accuracy and power of the analysis. We propose a SSD methodology for one and two-sample t-tests that ensures clinical relevance using a pre-determined unstandardized effect size. Our novel approach leverages Bayesian meta-analysis to account for the uncertainty surrounding the variance, a common issue in SSD. By incorporating prior knowledge from related studies via a Bayesian gamma-inverse gamma model, we obtain an informative posterior predictive distribution for the variance that leads to better decisions about sample size. For efficient posterior sampling, we propose an empirical Bayes approach, which is further combined with a quantile simulation approach to facilitate computation. Simulations and empirical studies demonstrate that our methodology outperforms other aggregate approaches (simple average, weighted average, median) in variance estimation for SSD, especially in meta-analyses with large disparity in sample size and moderate variance. Thus, it offers a robust and practical solution for sample size determination in t-tests."

Title: "Multimodal Large Language Models for Biomedical Applications"

When: September 13th, 2024

Junzhou Huang
Jenkins Garrett Professor in the Computer Science and Engineering department at the University of Texas at Arlington

Abstract: "Biomedical research is increasingly characterized by the availability of vast and diverse data types, ranging from imaging data, genetic sequences, and molecular profiles to clinical texts and patient records. This rich array of biomedical data presents significant opportunities for advancing our understanding of complex biological systems and improving healthcare outcomes. However, the challenge lies in effectively integrating and analyzing these multimodal datasets to extract meaningful insights. Large Language Models (LLMs) have recently emerged as powerful tools capable of processing and understanding diverse data modalities, enabling more comprehensive and accurate insights into biomedical applications. This talk will introduce several recent works that leverage multimodal LLMs to address key challenges across different biomedical domains. Specifically, we will explore the development and application of multimodal LLMs for computational pathology, gene ontology, and computational immunology. These approaches aim to bridge the gap between different data types, enabling more comprehensive and insightful interpretations that can drive new discoveries in biomedical science."

SPRING 2024

Title: " Statistics and the Knowledge Economy"

When: April 19th, 2024
Where: Frances Anne Moody Hall, Southern Methodist University

David Banks
Professor of the Practice of Statistics, Duke University

Abstract: "Statistics came of age when manufacturing was king. But today’s industries are focused on information technology. Remarkably, a lot of our expertise transfers directly. This talk will discuss statistics and AI in the context of computational advertising, autonomous vehicles, large language models, and process optimization."

Title: "Frequency Band Analysis of Nonstationary Multivariate Time Series"

When: April 12th, 2024

Raanju Sundararajan
Assistant Professor in the Department of Statistics and Data Science at Southern Methodist University

Abstract: "Information from frequency bands in biomedical time series provides useful summaries of the observed signal. Many existing methods consider summaries of the time series obtained over a few well-known, pre-defined frequency bands of interest. However, these methods do not provide data-driven methods for identifying frequency bands that optimally summarize frequency-domain information in the time series. A new method to identify partition points in the frequency space of a multivariate locally stationary time series is proposed. These partition points signify changes across frequencies in the time-varying behavior of the signal and provide frequency band summary measures that best preserve the nonstationary dynamics of the observed series. An L_2 norm-based discrepancy measure that finds differences in the time-varying spectral density matrix is constructed, and its asymptotic properties are derived. New nonparametric bootstrap tests are also provided to identify significant frequency partition points and to identify components and cross-components of the spectral matrix exhibiting changes over frequencies. Finite-sample performance of the proposed method is illustrated via simulations. The proposed method is used to develop optimal frequency band summary measures for characterizing time-varying behavior in resting-state electroencephalography (EEG) time series, as well as identifying components and cross-components associated with each frequency partition point."

Title: "Interplay of Linear Algebra, Machine Learning, and High Performance Computing"

When: April 5th, 2024

Xiaoye Sherry Li
Senior Scientist in the Computational Research Division, Lawrence Berkeley National Laboratory

Abstract: "In recent years, we have seen a large body of research using hierarchical matrix algebra to construct low complexity linear solvers and preconditioners. Not only can these fast solvers significantly accelerate the speed of large scale PDE based simulations, but also they can speed up many AI and machine learning algorithms which are often matrix-computation-bound. On the other hand, statistical and machine learning methods can be used to help select best solvers or solvers' configurations for specific problems and computer platforms. In both of these fields, high performance computing becomes an indispensable cross-cutting tool for achieving real-time solutions for big data problems. In this talk, we will show our recent developments in the intersection of these areas. "

Title: "Robust Mendelian Randomization coupled with Alphafold2 for drug target discovery"

When: March 29th, 2024

Zhonghua Liu
Assistant Professor in the Department of Biostatistics at Columbia University

Abstract: "Mendelian randomization (MR) uses genetic variants as instrumental variables (IVs) to infer the causal effect of a modifiable exposure on the outcome of interest by removing unmeasured confounding bias. However, some genetic variants might be invalid IVs due to violations of core IV assumptions. MR analysis with invalid IVs might lead to biased causal effect estimate and misleading scientific conclusions. To address this challenge, we propose a novel MR method that firstSelects valid genetic IVs and then performsPost-selection Inference (MR-SPI) based on two-sample genome-wide summary statistics. We analyze 912 plasma proteins using the large-scale UK Biobank proteomics data in 54,306 participants and identify 7 proteins (TREM2, PILRB, PILRA, EPHA1, CD33, RET, CD55) significantly associated with the risk of Alzheimer’s disease. We employ AlphaFold2 to predict the 3D structural alterations of these 7 proteins due to missense genetic variations, providing new insights into their biological functions in disease etiology. "

Title: "Microbes And Climate Change: Insights From A Grassland Experiment"

When: March 1st, 2024, 12 pm
Where: EES 100

Jizhong Zhou
George Lynn Cross Research Professor in School of Biological Sciences, Director of Institute for the Environmental Genomics, The University of Oklahoma

Abstract: "The acceleration of global climate warming, a consequence of the buildup of atmospheric CO2 and other greenhouse gases due to fossil fuel combustion and land use change, represents one of the greatest scientific and policy concerns in the 21st century. Understanding the mechanisms of biospheric feedbacks to climate change is critical to project future climate warming. Although microorganisms catalyze most of biosphere processes related to fluxes of greenhouse gases, the roles of microorganisms in regulating future climate change remain elusive. With time-series data from a long-term climate change experiment at Oklahoma, our results showed that microorganisms play central roles in regulating soil carbon dynamics through three primary feedback mechanisms, climate warming stimulates microbial temporal turnovers and divergent succession, enhances network complexity and stability, but reduces microbial diversity. Our results also demonstrated that incorporating microbial community information significantly improve the predictability of global change models. All these results have important implications in modeling and predicting future climate change, as well as for policy-making. "

Title: "Genetic prediction of disease risk across populations"

When: February 23th, 2024

Hongyu Zhao
The Ira V. Hiscock Professor of Biostatistics at School of Public Health, Yale University

Abstract:"Polygenic risk score (PRS) has demonstrated its great utility in biomedical research through identifying high risk individuals for different diseases from their genotypes. However, the broader application of PRS to the general population is hindered by the limited transferability of PRS developed in Europeans to non-European populations. To improve PRS prediction accuracy in non-European populations, we develop Bayesian methods that can effectively integrate genome wide association study summary statistics from different populations. Our methods automatically adjust for linkage disequilibrium differences between populations, and characterize the joint distribution of the effect sizes of a variant in different populations to be both null, population specific or shared with correlation. Through simulations and applications to real traits, we show that our methods improve the prediction performance over existing methods in non-European populations. "

Title: "Bridging the Gap: From AI Research to Clinical Application in Medicine"

Abstract: "This talk provides an extensive overview of the challenges and potential solutions in implementing AI in clinical medicine. It traces the transition from research to real-world application, focusing on critical aspects such as model generalizability, commissioning, and performance deterioration over time, etc. The presentation also addresses the integration of AI into existing medical workflows and the adaptation to different physician styles, highlighting the importance of a system-centric approach. Emphasizing the need for real-world adaptability and clinician engagement, this talk aims to shed light on developing AI tools that are not only technologically advanced but also seamlessly integrated and functional in diverse clinical settings."

When: February 2nd, 2024

Steve B. Jiang
Barbara Crittenden Professor in Medical Artificial Intelligence and Automation Lab and Department of Radiation Oncology, University of Texas Southwestern Medical Center

Title: "BART: The Remarkable Flexibility of a Bayesian Ensemble of Trees"

Abstract: "Motivated by ensemble methods in general, and gradient boosting in particular, BART (Bayesian Additive Regression Trees) is a Bayesian nonparametric regression approach for the discovery of the underlying relationship between Y and a multidimensional vector x. Approximating the conditional mean E[Y|x] with a sum of regression trees, BART is built on a statistical model: a likelihood combined with a prior that regularizes the trees to be dimensionally adaptive weak learners. Fitting and inference are accomplished with rapidly mixing Bayesian backfitting MCMC algorithms that enable full posterior inference, including point and interval estimates of E[Y|x], as well as model-free variable selection. To further illustrate the modeling flexibility of a Bayesian ensemble of trees, we also consider two BART elaborations: MBART and HBART. Exploiting potential monotonicities of E[Y|x], MBART incorporates a basis of multivariate monotone trees, thereby enabling the discovery and estimation of decompositions of the directions of E[Y|x] into their unique monotone increasing and decreasing components. To detect and mitigate the possible presence of heteroscedasticity, HBART incorporates an additional product-of-trees model component for the conditional variance, thereby enabling simultaneous inference about both E[Y|x] and Var[Y|x]. (This is joint research with H. Chipman, M. Pratola, R. McCulloch and T. Shively)."

When: January 26th, 2024

Edward I. George
Universal Furniture Professor Emeritus of Statistics and Data Science, The Wharton School University of Pennsylvania

FALL 2023

Title: "pan-MHC and cross-Species Prediction of T Cell Receptor-Antigen Binding with pMTnet-omni"

When: December 1st, 2023

Tao Wang
Associate Professor, UT Southwestern Medical Center

Abstract: "Profiling the binding of T cell receptor (TCR) of T cells towards antigenic peptides presented by MHC proteins is one of the most important unsolved problems in modern immunology. Traditional experimental methods to probe TCR-antigen interactions are slow, labor-intensive, costly, and low- to middle-throughput. To address this problem, we developed pMTnet-omni, an Artificial Intelligence (AI) system based on hybrid protein sequence and structure information, to predict the pairing of TCRs of αβ T cells with peptide-MHC complexes (pMHCs). pMTnet-omni is capable of handling peptides presented by both class I and II pMHCs, and capable of handling both human and mouse TCR-pMHC pairs. pMTnet-omni achieves a high overall Area Under the Curve of Receiver Operator Characteristics (AUROC) of 0.89, which surpasses competing tools by a large margin. We showed that pMTnet-omni can distinguish binding affinity of TCRs with similar sequences. Across a range of datasets from various biological contexts, pMTnet-omni characterized the longitudinal evolution and spatial heterogeneity of TCR-pMHC interactions and their functional impact. We successfully built a biomarker based on pMTnet-omni for predicting immune-related adverse events of immune checkpoint inhibitor (ICI) treatment in a cohort of 57 ICI-treated patients. pMTnet-omni represents a large step closer to a clinically usable AI system for TCR-pMHC pairing prediction that can aid the design and implementation of TCR-based immunotherapeutics."

Title: "Recasting Computer Science Problems in Data Science"

When: November 10th, 2023

Suku Nair
Vice Provost for Research & Chief Innovation Officer Director, AT&T Center for Virtualization, Southern Methodist University

Abstract: "One of my favorite quotes in the cyber world is “In this day and age, either you are going to touch a computer or you are going to be touched by it”. A corollary in the data world is “Either you are going to produce data or you are going to consume it”. There lies the intertwined worlds of computer science and data science. However, in spite of obvious overlaps, fundamental objectives and methodologies in computer science and data science differ significantly. Computer science is built on universal principles to study and solve problems in computation, algorithms, and abstraction. The common approach is to create general-purpose solutions through algorithmic design(s). In contrast, data science is more empirical in nature, founded on data-driven exploration, statistical analysis, and machine learning for understanding and solving problems. With the availability of large amounts of data in most domains and tremendous advancements in computing capabilities, several problems in computer science have been recast as data science problems for more efficient solutions. In this talk we will present data- driven solution transitions for some of the problems in our research domains including Cyber Security, System Reliability, Computer and telecom networks, and Human Machine Interfaces."

Title: "Doubly Flexible Estimation under Label Shift"

When: November 17th, 2023

Yanyuan Ma
Professor, Pennsylvania State University

Abstract: "In studies ranging from clinical medicine to policy research, complete data are usually available from a population P, but the quantity of interest is often sought for a related but different population Q which only has partial data. In this paper, we consider the setting that both outcome Y and covariate X are available from P whereas only X is available from Q, under the so-called label shift assumption, i.e., the conditional distribution of X given Y remains the same across the two populations. To estimate the parameter of interest in population Q via leveraging the information from population P, the following three ingredients are essential: (a) the common conditional distribution of X given Y , (b) the regression model of Y given X in population P, and(c) the density ratio of the outcome Y between the two populations. We propose an estimation procedure that only needs some standard nonparametric regression technique to approximate the conditional expectations with respect to (a), while by no means needs an estimate or model for (b) or (c); i.e., doubly flexible to the possible model misspecifications of both (b) and (c). This is conceptually different from the well-known doubly robust estimation in that, double robustness allows at most one model to be mis-specified whereas our proposal here can allow both (b) and (c) to be mis-specified. This is of particular interest in our setting because estimating (c) is difficult, if not impossible, by virtue of the absence of the Y -data in population Q. Furthermore, even though the estimation of (b) is sometimes off-the-shelf, it can face curse of dimensionality or computational challenges. We develop the large sample theory for the proposed estimator and examine its finite-sample performance through simulation studies as well as an application to the MIMIC-III database."

Title: "Omics and data science in public health"

When: October 20th, 2023, 12pm-1pm

Where: EES 100

Andrea Baccarelli MD, PhD
Leon Hess Professor and Chair, Department of Environmental Health Sciences Columbia University

Abstract: "Dr. Baccarelli will present methods and results from recent and ongoing studies using molecular biology approaches coupled to data science to identify individuals that are more impacted or susceptible to harmful environmental exposures. He will introduce a cadre of methods ranging from epigenome-wide DNA methylation to exosome/extracellular vesicles as promising new paths to enhance understanding of the effects of environmental exposures on human health. Over the past few years, the application of contemporary machine learning methods to epigenomics, specifically to DNA methylation data, has shown that DNA methylation can provide accurate fingerprints of environmental factors, including tobacco smoking, environmental chemicals, and lifestyle. Those fingerprints reflect current exposure, but they also correlate well with past and cumulative exposure. Many investigators have compared the epigenome to a recording device built in our cells that captures both external and internal conditions. Using this framework provides untapped opportunities to identify the impact of risk factors at the individual level, as well as new approaches for risk stratification and personalized prevention. In this presentation, Dr. Baccarelli will review current evidence from recent studies and potential contributions to human health and disease. He will discuss data sources, methodological challenges for large human studies, limitations, and possible future directions."

Title: "Cancer prognosis analysis via integrating molecular and histopathological imaging features"

Abstract: "Modeling cancer prognosis is a “classic” yet still challenging problem. In the past two decades, high- throughput molecular data have been extensively used in such analysis. Very recently, it has been shown that histopathological imaging features, which are generated in the biopsy process, are also informative for modeling prognosis (and other outcomes/phenotypes). Molecular and imaging data contain overlapping as well as independent information. In our recent studies, we have developed regularization techniques, testing the degree of independent information for prognosis and integrating the two distinct types of data for prognosis modeling under homogeneity as well as heterogeneity."

When: September 22nd, 2023

Shuangge Ma, Ph.D.
Professor and Chair, Biostatistics Department, Yale University School of Public Health

DIVISION OF DATA SCIENCE EVENTS

SPRING 2025

FALL 2024

SPRING 2024

FALL 2023

COLLEGE OF SCIENCE

Social Media

CONTACT US