Skip to main content
  • Description

    Workshop Dates: January 11-12th, 2021 | Workshop Agenda

    The Rising Stars in Data Science workshop is a new initiative from the Center for Data and Computing (CDAC) at the University of Chicago, focusing on celebrating and fast tracking the careers of exceptional data scientists at a critical inflection point in their career: the transition from PhD to postdoctoral scholar, research scientist, or tenure track position. The workshop also aims to increase representation and diversity in data science by providing a platform and a supportive mentoring network to navigate academic careers in data science. Women and underrepresented minorities in computing are especially encouraged to apply. 

    The two-day, remote research workshop will feature career and research panels, networking and mentoring opportunities, and 30-minute student research talks. Students will gain insights from faculty panels on career development questions such as: how to start your academic career in data science; how to strategically sustain your career through research collaborations, publications, and skill development; and how to form meaningful interdisciplinary collaborations in data science with industry and government partners. Participants will also hear inspiring keynote talks from established, cutting-edge leaders in data science.

    Eligibility & Guidelines

    • Applicants must be full time graduate students within ~1-2 years of obtaining a PhD.
    • Applicants should be pursuing doctoral degrees in computer science, statistics, data science, or a related computational field. 
    • Applicants both from and outside of the University of Chicago are encouraged to apply.
    • Applicants may only submit one application.
    • Applicants may have nomination letters from a maximum of 2 faculty members.

    Workshop Format

    • Student research talks
    • Panels (career development, data science research)
    • Keynote address
    • 1:1 meetings with faculty members
    • Networking within the UChicago data science ecosystem
  • Rising Stars Profiles

    2021 Rising Stars

    Talk Title: Evaluating the Impact of Entity Resolution in Social Network Metrics

    Talk Abstract: Modern databases are filled with potential duplicate entries—caused by misspellings, change in address, or differences in abbreviations. The probabilistic disambiguation of entries is often referred to as entity resolution.

    Entity resolution of individuals (nodes) in relational datasets is often viewed as a pre-processing step in network analysis. Studies in bibliometrics have indicated that entity resolution changes network properties in citation networks, but little research has used real-world social networks that vary in size and type. We present a novel perspective on entity resolution in networks—where we propagate error from the entity resolution process into downstream network inferences. We also seek to understand how match thresholds in unsupervised entity resolution affect both global and local network properties, such as the degree distribution, centrality, transitivity, and motifs such as stars and triangles. We propose a calibration of these network metrics given measures of entity resolution quality, such as node “splitting” and “lumping” errors.

    We use a respondent driven sample of people who use drugs (PWUD) in Appalachia and a longitudinal network study of Chicago-based young men who have sex with men (YMSM) to demonstrate the implications this has for social and public health policy.

    Bio: Abby Smith is a Ph.D. Candidate in Statistics at Northwestern University. Her work centers around evaluating the impact of entity resolution error in social network inferences. She is particularly interested in collaborative research and data science for social good applications, and most recently served as a Solve for Good consultant at the mHealth nonprofit Medic Mobile. Abby is passionate about building community for women in statistics and data science in Chicago, and serves as a WiDS Ambassador and R-Ladies: Chicago board member. She holds a Masters in Statistical Practice and a B.S. in Mathematics, both from Carnegie Mellon.

    Talk Title: Modeling the Impact of Social Determinants of Health on Covid-19 Transmission and Mortality to Understand Health Inequities

    Talk Abstract: The Covid-19 pandemic has highlighted drastic health inequities, particularly in cities such as Chicago, Detroit, New Orleans, and New York City. Reducing Covid-19 morbidity and mortality will likely require an increased focus on social determinants of health, given their disproportionate impact on populations most heavily affected by Covid-19. A better understanding of how factors such as household income, housing location, health care access, and incarceration contribute to Covid-19 transmission and mortality is needed to inform policies around social distancing and testing and vaccination scale-up.

    This work builds upon an existing agent-based model of Covid-19 transmission in Chicago, CityCOVID. CityCOVID consists of a synthetic population that is statistically representative of Chicago’s population (2.7 million persons), along with their associated places (1.4 million places) and behaviors (13,000 activity schedules). During a simulated day, agents move from place-to-place, hour-by-hour, engaging in social activities and interactions with other colocated agents, resulting in an endogenous colocation or contact network. Covid-19 transmission is determined via a simulated epidemiological model based on this generated contact network by tuning (fitting) model parameters that result in simulation output that matches observed Covid-19 death and hospitalization data from the City of Chicago. Using the CityCOVID infrastructure, we quantify the impact of social determinants of health on Covid-19 transmission dynamics by applying statistical techniques to empirical data to study the relationship between social determinants of health and Covid-19 outcomes.

    Bio: Abby Stevens is fourth year statistics PhD student at the University of Chicago advised by Rebecca Willett. She is interested in using data science techniques to address important social and political issues, such as climate science, public health, and algorithmic fairness. She graduated with a math degree from Grinnell College in 2014 and then worked as a data scientist at a healthcare tech company before entering graduate school. She has been involved in a number of data science for social good organizations and is a primary organizer of the Women in Data Science Chicago annual event.

    Talk Title: Covariant Neural Networks for Physics Applications

    Talk Abstract: Most traditional neural network architectures do not respect any intrinsic structure of the input data, and instead are expect to “learn” it. CNNs are the first widespread example of a symmetry, in this case the translational symmetry of images, being used to advise much more efficient and transparent network architectures. More recently, CNNs were generalized to other non-commutative symmetry groups such as SO(3). However, in physics application one is more likely to encounter input data that belong to linear representations of Lie Groups, as opposed to being functions (or “images”) on a symmetric space of the group.

    To deal with such problems, I will present a general feed-forward architecture that takes vectors as inputs, works entirely in the Fourier space of the symmetry group, and is fully covariant. This approach allows one to achieve equal performance with drastically fewer learnable parameters, Moreover, the models become much more physically meaningful and more likely to be interpretable. My application of choice is in particle physics, where the main symmetry is the 6-dimensional Lorentz group. I will demonstrate the success of covariant architectures compared to more conventional approaches.

    Bio: I am a PhD student at the University of Chicago working on theoretical hydrodynamics problems in relation to the quantum Hall effect. In addition, I am working on developing new group-covariant machine learning tools for physics applications, such as Lorentz-covariant neural networks for particle physics. My background is in mathematical physics, in which I hold a master’s degree from the Saint-Petersburg University in Russia. My interests lie on the intersection of theoretical and mathematical physics and new inter-disciplinary applications of such ideas.

    Talk Title: Credible and Effective Data-Driven Decision-Making: Minimax Policy Learning under Unobserved Confounding

    Talk Abstract: We study the problem of learning causal-effect maximizing personalized decision policies from observational data while accounting for possible unobserved confounding. Since policy value and regret may not be point-identifiable, we study a method that minimizes the worst-case estimated regret over an uncertainty set for propensity weights that controls the extent of unobserved confounding. We prove generalization guarantees that ensure our policy will be safe when applied in practice and will in fact obtain the best-possible uniform control on the range of all possible population regrets that agree with the possible extent of confounding. Finally, we assess and compare our methods on synthetic and semi-synthetic data. In particular, we consider a case study on personalizing hormone replacement therapy based on the parallel WHI observational study and clinical trial. We demonstrate that hidden confounding can hinder existing policy learning approaches and lead to unwarranted harm, while our robust approach guarantees safety and focuses on well-evidenced improvement.  This work is joint with Nathan Kallus. An earlier version was circulated as “Confounding-Robust Policy Improvement”.  Time permitting, I will highlight recent follow-up work on robust policy evaluation for infinite-horizon reinforcement learning. 

    Bio: My research interests are at the intersection of statistical machine learning and operations research in order to inform reliable data-driven decision-making. Specifically, I have developed fundamental contributions and algorithmic frameworks for robust causal-effect-maximizing personalized decision rules in view of unobserved confounding, as well as methodology for credible impact evaluation for algorithmic fairness with high potential impact in industry and policy. My work has been published in journals such as Management Science and top-tier computer science/machine learning venues (Neurips/ICML), and has received a INFORMS Data Mining Section Best Paper award. My work was previously supported on a NDSEG (National Defense Science and Engineering) Graduate Fellowship.

    Talk Title: AI for Population Health: Melding Data and Algorithms on Networks

    Talk Abstract: As exemplified by the COVID-19 pandemic, our health and wellbeing depend on a difficult-to-measure web of societal factors and individual behaviors. Tackling social challenges with AI requires algorithmic and data-driven paradigms which span the full process of gathering costly data, learning models to understand and predict interactions, and optimizing the use of limited resources in interventions. This talk presents methodological developments at the intersection of machine learning, optimization, and social networks which are motivated by on-the-ground collaborations on HIV prevention, tuberculosis treatment, and the COVID-19 response. These projects have produced deployed applications and policy impact. For example, I will present the development of an AI-augmented intervention for HIV prevention among homeless youth. This system was evaluated in a field test enrolling over 700 youth and found to significantly reduce key risk behaviors for HIV.

    Bio: Bryan Wilder is a final-year PhD student in Computer Science at Harvard University, where he is advised by Milind Tambe. His research focuses on the intersection of optimization, machine learning, and social networks, motivated by applications to population health. His work has received or been nominated for best paper awards at ICML and AAMAS, and was a finalist for the INFORMS Doing Good with Good OR competition. He is supported by the Siebel Scholars program and previously received a NSF Graduate Research Fellowship.

    Talk Title: Towards Data-Driven Internet Routing Security

    Talk Abstract: The Internet ecosystem is critical for the reliability of online daily life. However, key Internet protocols, such as the Border Gateway Protocol (BGP), were not designed to cope with untrustworthy parties, making them vulnerable to misconfigurations and attacks from anywhere in the network. In this talk, I will present an evidence-based data-driven approach to improve routing infrastructure security, which I use to identify and characterize BGP serial hijackers, networks that persistently hijack IP address blocks in BGP. I’ll also show how similar approaches can quantify the benefits of the RPKI security framework against prefix hijacks, and identify route leaks. This work improves our understanding about how our Internet actually works and has been used by industry and researchers for network reputation and monitoring of operational security practices.

    Bio: Cecilia Testart is a PhD candidate in EECS at MIT, working with David D. Clark. Her research is at the intersection of computer networks, data science and policy. Her doctoral thesis focuses on securing the Internet’s core routing protocols, leveraging machine learning and data science approaches to understand the impact of protocol design in security, and considering both technical and policy challenges to improve the current state-of-the-art. Cecilia holds Engineering Degrees from Universidad de Chile and Ecole Centrale Paris and a dual-master degree in Technology and Policy and EECS from MIT. Prior to joining MIT, she helped set up the Chilean office of Inria (the French National Institute for Research in Digital Science and Technology) and worked for the research lab of the .CL, the Chilean top-level domain. She has interned at Akamai, MSR and the OECD. Cecilia’s work was awarded with a Distinguished paper award at the ACM Internet Measurement Conference in 2019.

    Talk Title: Machine Learning for Astrophysics & Cosmology in the Era of Large Astronomical Surveys and an Application for the Discovery and Classification of Faint Galaxies

    Talk Abstract: Observational astrophysics & cosmology are entering the era of big-data. Future astronomical surveys are expected to collect hundreds of petabytes of data and detect billions of objects. Machine learning will play an important role in the analysis of these surveys with the potential to revolutionize astronomy, as well as providing challenging problems that can give opportunities for breakthroughs in the fundamental understanding of machine learning
    In this talk I will present the discovery of Low Surface Brightness Galaxies (LSBGs) from the Dark Energy Survey (DES) data. LBGSs are galaxies with intrinsic brightness less than that of the dark sky, and so are hard to detect and study. At the same time, they are expected to dominate the number density of galaxies in the universe, which thus remains relatively unexplored. I will discuss the development of automated, deep learning-based, pipelines for LSBG detection (separation of LSB galaxies from LSB artifacts present in images) and morphological classification. Such techniques will be extremely valuable in the advent of very large future surveys like the planned Legacy Survey of Space and Time (LSST) on the Vera C. Rubin Observatory.

    Bio: Dimitrios Tanoglidis is a fifth-year PhD student at the department of Astronomy & Astrophysics at the University of Chicago. He holds a BSc in Physics and MSc in Theoretical Physics, both from the University of Crete, Greece. His research interests lie in cosmology, analysis of large galaxy surveys, and data science applications in astrophysics. He has led the research for the discovery and analysis of Low Surface Brightness Galaxies from the Dark Energy Survey Data using machine learning. Interdisciplinary in nature, he is also pursuing a certificate in Computational Social Science.

    Talk Title: Network Effects on Outcomes and Unequal Distribution of Resources

    Talk Abstract: We study how networks affect different groups differently and provide pathways to reinforce existing inequalities. First we provide observational evidence for differential network advantages in access to information: individuals from the low status group receive lower marginal benefit from networking than the high status group. Second, we provide causal evidence for differential diffusion of a new behavior in the network, mainly driven due to homophily and slight initial advantages of a group. Third, we develop a theoretical network model that captures the network structure of unequal access to opportunities. We show that any departure from the uniform distribution of links to information sources among members of a group limits the diffusion of information to the group as a whole. Fourth, we develop an online lab experiment to further study the network mechanisms that widen inter-group differences and yield different returns on social capital to different groups. We recruit individuals to play an online collaborative game in which they have to find and dig gold mines and in the process can pass information to their network neighbors. By changing the network structure and composition of groups with low and high initial advantage, we generate the processes that lead to unequal distribution of opportunities, beyond what’s expected by individual differences. Finally, we contribute to the literature on network structure and performance and propose the concept of bandwidth-diversity matching: individuals who match the tie strength to their contacts with their information novelty achieve truly diverse networks and better outcomes.

    Bio: I am a PhD candidate in the Social and Engineering Systems program at MIT IDSS, under supervision of Prof. Pentland and Prof. Eckles. I am also receiving a second PhD in Statistics from the Statistics and Data Science Center at MIT. I received my Bachelor’s and Master’s degrees in Computer Science both from the University of Michigan – Ann Arbor.
    My PhD research is focused on micro-level structural factors, such as network structure, that contribute to unequal distribution of resources or information. As a computational social scientist, I use methods from network science, statistics, experiment design and causal inference. I am also interested in understanding the collective behavior in institutional settings, the institutional mechanisms that promote cooperative behavior in networks, or in contrast lead to unequal outcomes for different groups.
    In a previous life, I worked at Google New York City as a software engineer from 2011 to 2015. Currently, I am also a research contractor at Facebook working on how networks affect economic outcomes.

    Talk Title: What and How Students Read: A Data-driven Insight

    Talk Abstract: Reading is an integral part of learning. The purpose of reading to learn is to comprehend meaning from informational texts. Reading comprehension tasks require self-regulated learning (SRL) behaviors – to plan, monitor, and evaluate one’s reading strategies. Students without SRL skills may struggle in reading which in turn may inhibit them to acquire domain-specific knowledge. Thus, understanding students reading behavior and SRL usage is important for intervention. Digital reading platforms can provide opportunities to learn and practice SRL strategies in classroom settings. These platforms log rich array of student and teacher interaction data with the systems. Retrospective analysis of these logged data can derive insights– which can be used to support tailored interventions by instructors and students in complex learning activities. In this talk, I will discuss students’ science reading and SRL behaviors, and connect those behaviors with performance within a digital literacy platform, Actively Learn. The talk consists of two studies (i) identifying patterns that differ between productive and unproductive students (iI) analyzing the association of teachers’ behavior and students’ SRL usage. I will finish my talk by underlying possible future directions.

    Bio: Effat Farhana is a Ph.D. Candidate in the Computer Science Department at North Carolina State University working with Dr. Collin F. Lynch in the ArgLab research group. She received her B.S. in Computer Science and Engineering from Bangladesh University of Engineering and Technology. Her research focuses on mining educational software to derive data-driven heuristics, machine learning, and designing interpretable machine learning algorithms.

    Talk Title: Quantifying The Power of Mental Shortcuts in Persuasive Communication with Causal Inference from Text

    Talk Abstract: The reliance of individuals on mental shortcuts based on factors such as gender, affiliation, and social status could distort the equitability of interpersonal discussions in various settings. Yet, the impact of such shortcuts in real-world discussions remains challenging to quantify. In this talk, I propose a novel quasi-experimental study that incorporates unstructured text in a principled manner to quantify the causal effect of status indicators in persuasive communication. I also examine how linguistic and rhetorical devices moderate this effect, and thus provide communication strategies to potentially reduce individuals’ reliance on mental shortcuts. I discuss implications for fair communication policies both within organizations and in society at large.

    Bio: Emaad Manzoor is a PhD candidate in the Heinz College of Information Systems and Public Policy at Carnegie Mellon University, and will begin as an assistant professor of Operations and Information Management at the University of Wisconsin-Madison in Fall 2021. Substantively, he designs randomized experiments and quasi-experimental studies to quantify the persuasive power of mental shortcuts in text-based communication, and how language can be used to moderate this power. Methodologically, he develops data-mining techniques for evolving networks and statistical frameworks for causal inference with text. He is funded by a 2020 McKinsey & Company PhD Fellowship, and was a finalist for the 2019 Snap Research PhD Fellowship, the 2019 Jane Street Depth First Learning Fellowship, and the 2019 INFORMS Annual Meeting Best Paper award.

    Talk Title: Machine Learning in Dynamical Systems

    Talk Abstract: Many branches of science and engineering involve estimation and control in dynamical systems; consider, for example, using data to help stabilize the flight of a drone or predict the path of a hurricane. We consider control in dynamical systems from the perspective of regret minimization. Unlike most prior work in this area, we focus on the problem of designing an online controller which competes with the best dynamic sequence of control actions selected in hindsight, instead of the best controller in some specific class of controllers. This formulation is attractive when the environment changes over time and no single controller achieves good performance over the entire time horizon. We derive the structure of the regret-optimal online controller using techniques from robust control theory and present a clean data-dependent bound on its regret. We also present numerical simulations which confirm that our regret-optimal controller significantly outperforms various classical controllers in dynamic environments.

    Bio: Gautam is a PhD student in the Computing and Mathematical Sciences (CMS) department at Caltech, where he is advised by Babak Hassibi. He is broadly interested in machine learning, optimization, and control, especially 1) online learning and online decision-making and 2) integrating machine learning with physics, dynamics and control. Much of his PhD work has been supported by a National Science Foundation Graduate Research Fellowship and an Amazon AWS AI Fellowship. Prior to joining Caltech, he obtained a BS in Mathematics from Georgia Tech.

    Talk Title: Adversarial Collusion on the Web: State-of-the-art and Future Directions

    Talk Abstract: The growth and popularity of online media have made it the most important platform for collaboration and communication among its users. Given its tremendous growth, the social reputation of an entity in online media plays an important role. This has led to users choosing artificial ways to gain social reputation by means of blackmarket services as the natural way to boost social reputation is time-consuming. We refer to such artificial ways of boosting social reputation as collusion. In this talk, we will comprehensively review recent developments in analyzing and detecting collusive entities on online media. First, we give an overview of the problem and motivate the need to detect these entities. Second, we survey the state-of-the-art models that range from designing feature-based methods to more complex models, such as using deep learning architectures and advanced graph concepts. Third, we detail the annotation guidelines, provide a description of tools/applications and explain the publicly available datasets. The talk concludes with a discussion of future trends.

    Bio: Hridoy Sankar Dutta is currently pursuing his Ph.D. in Computer Science and Engineering from IIIT-Delhi, India. Starting January 2021, he will be joining University of Cambridge as a Research Assistant in the Cambridge Cybercrime Centre. His current research interests include data-driven cybersecurity, social network analysis, natural language processing, and applied machine learning. He received his B.Tech degree in Computer Science and Engineering from Institute of Science and Technology, Gauhati University, India in 2013. From 2014 to 2015, he worked as an Assistant Project Engineer at the Indian Institute of Technology, Guwahati (IIT-G), India, for the project ‘Development of Text to Speech System in Assamese and Manipuri Languages’. He completed his M.Tech in Computer Science and Engineering from NIT Durgapur, India in 2015. More details can be found at https://hridaydutta123.github.io/.

    Talk Title: Computer-Aided Diagnosis of Thoracic CT Scans Through Multiple Instance Transfer Learning

    Talk Abstract: Computer-aided diagnosis systems have demonstrated significant potential in improving patient care and clinical outcomes by providing more extensive information to clinicians.  The development of these systems typically requires a large amount of well-annotated data, which can be challenging to acquire in medical imaging.  Several techniques have been investigated in an attempt to overcome insufficient data, including transfer learning, or the application of a pre-trained model to a new domain and/or task.  The successful translation of transfer learning models to complex medical imaging problems holds significant potential and could lead to widespread clinical implementation.

    However, transfer learning techniques often fail translate effectively because they are limited by the domain in which they were initially trained.  For example, computed tomography (CT) is a powerful medical imaging modality that leverages 3D images in clinical decision-making, but transfer learning models are typically trained on 2D images and thus can not incorporate the additional information provided by the third dimension.  This evaluation of the available data in a CT scan is inefficient and potentially does not effectively improve clinical decisions.  In this project, the 3D information available in CT scans is combined incorporated with transfer learning through a multiple instance learning (MIL) scheme, which can individually assess 2D images and form a collective 3D prediction based on the 2D information, similar to how a radiologist would read a CT scan.  This approach has been applied to evaluate both COVID-19 and emphysema in CT thoracic CT scans and demonstrated strong clinical potential. 

    Bio: Jordan Fuhrman is a student in the Graduate Program in Medical Physics at the University of Chicago. Since joining the program after his graduation from the University of Alabama in 2017, Jordan’s research has focused on the investigation of computer-aided diagnosis techniques for evaluating CT scans. Generally, this includes implementation of machine learning, deep learning, and computer vision algorithms to accomplish such tasks as disease detection, image segmentation, and prognosis assessments. His primary research interests lie in the development of novel approaches that incorporate the full wealth of information in CT scans to better inform clinical predictions, the exploration of explainable, interpretable outputs to improve clinical understanding of deep learning algorithm performance, and the early detection and prediction of patient progress to inform clinical decisions (e.g., most appropriate treatment) and improve patient outcomes. His work has largely focused on incidental disease assessment in low-dose CT lung screening scans, including emphysema, osteoporosis, and coronary artery calcifications, but has also included non-screening scan assessments of hypoxic ischemic brain injury and COVID-19. Jordan is a student member of both the American Association of Physicists in Medicine (AAPM) and the Society of Photo-optical Instrumentation Engineers (SPIE).

    Talk Title: How to Preserve Privacy in Data Analysis?

    Talk Abstract: The past decade has witnessed the tremendous success of large-scale data science. However, recent studies show that many existing powerful machine learning tools used in large-scale data science pose severe threats to personal privacy. Therefore, one of the major challenges in data analysis is how to learn effectively from the enormous amounts of sensitive data without giving up on privacy. Differential Privacy (DP) has recently emerged as a new gold standard for private data analysis due to the statistical data privacy it can provide for sensitive information. Nevertheless, the adaptation of DP to data analysis remains challenging due to the complex models we often encountered in data analysis. In this talk, I will focus on two commonly used models, i.e., the centralized and distributed/federated models, for differentially private data analysis. For the centralized model, I will present my efforts to provide strong privacy and utility guarantees in high-dimensional data analysis. For the distributed/federated model, I will discuss new efficient and effective privacy-preserving learning algorithms.

    Bio: Lingxiao Wang is a final year Ph.D. student in the Department of Computer Science at the University of California, Los Angeles, advised by Dr. Quanquan Gu. Previously he obtained his MS degree in Statistics at the University of Washington. Lingxiao’s research interests are broadly in machine learning, including privacy-preserving machine learning, optimization, deep learning, low-rank matrix recovery, high-dimensional statistics, and data mining. Lingxiao aims to apply his research for social good, and he is one of the core members of the Combating COVID-19 project (https://covid19.uclaml.org/).

    Talk Title: Systematic Evaluation of Privacy Risks of Machine Learning Models

    Talk Abstract: Machine learning models are prone to memorizing sensitive data, making them vulnerable to membership inference attacks in which an adversary aims to guess if an input sample was used to train the model. In this talk, we show that prior work on membership inference attacks may severely underestimate the privacy risks by relying solely on training custom neural network classifiers to perform attacks and focusing only on aggregate results over data samples, such as the attack accuracy.

    To overcome these limitations, we first propose to benchmark membership inference privacy risks by improving existing non-neural network based inference attacks and proposing a new inference attack method based on a modification of prediction entropy. Using our benchmark attacks, we demonstrate that existing membership inference defense approaches are not as effective as previously reported.

    Next, we introduce a new approach for fine-grained privacy analysis by formulating and deriving a new metric called the privacy risk score. Our privacy risk score metric measures an individual sample’s likelihood of being a training member, which allows an adversary to perform membership inference attacks with high confidence. We experimentally validate the effectiveness of the privacy risk score metric and demonstrate the distribution of privacy risk scores across individual samples is heterogeneous. Our work emphasizes the importance of a systematic and rigorous evaluation of privacy risks of machine learning models.

    Bio: Liwei Song is a fifth-year PhD student in the Department of Electrical Engineering at Princeton University, advised by Prof. Prateek Mittal. Before coming to Princeton, he received his Bachelor’s degree in Electrical Engineering from Peking University.

    His current research focus is on investigating security and privacy issues of machine learning models, including membership inference attacks, evasion attacks, and backdoor attacks. His evaluation methods on membership inference have been integrated into Google’s TensorFlow Privacy library. Besides that, he has also worked on attacking voice assistants with ultrasound, which received widespread media coverage, including BBC News and New York Times.

    Talk Title: Reasoning about Social Dynamics and Social Bias in Language

    Talk Abstract: Humans easily make inferences to reason about the social and power dynamics of situations (e.g., stories about everyday interactions), but such reasoning is still a challenge for modern NLP systems. In this talk, I will address how we can make machines reason about social commonsense and social biases in text, and how this reasoning could be applied in downstream applications.

    In the first part, I will discuss PowerTransformer, our new unsupervised model for controllable debiasing of text through the lens of connotation frames of power and agency. Trained using a combined reconstruction and paraphrasing objective, this model can rewrite story sentences such that its characters are portrayed with more agency and decisiveness. After establishing its performance through automatic and human evaluations, we show how PowerTransformer can be used to mitigate gender bias in portrayals of movie characters. Then, I will introduce Social Bias Frames, a conceptual formalism that models the pragmatic frames in which people project social biases and stereotypes onto others to reason about biased or harmful implications in language. Using a new corpus of 150k structured annotations, we show that models can learn to reason about high-level offensiveness of statements, but struggle to explain why a statement might be harmful. I will conclude with future directions for better reasoning about social dynamics and social biases.

    Bio: Maarten Sap is a final year PhD student in the University of Washington’s natural language processing (NLP) group, advised by Noah Smith and Yejin Choi. His research focuses on endowing NLP systems with social intelligence and social commonsense, and understanding social inequality and bias in language. In the past, he’s interned at AI2 on project Mosaic working on social commonsense reasoning, and at Microsoft Research working on long-term memory and storytelling with Eric Horvitz.

    Talk Title: Formal Logic Enhanced Deep Learning for Cyber-Physical Systems

    Talk Abstract: Deep Neural Networks are broadly applied and have outstanding achievements for prediction and decision-making support for Cyber-Physical Systems (CPS). However, for large-scale and complex integrated CPS with high uncertainties, DNN models are not always robust, often subject to anomalies, and subject to erroneous predictions, especially when the predictions are projected into the future (uncertainty and errors grow over time). To increase the robustness of DNNs for CPS, in my work, I developed a novel formal logic enhanced learning framework with logic-based criteria to enhance DNN models to follow system critical properties and build well-calibrated uncertainty estimation models. Trained in an end-to-end manner with back-propagation, this framework is general and can be applied to various DNN models. The evaluation results on large-scale real-world city datasets show that my work not only improves the accuracy of predictions and effectiveness of uncertainty estimation, but importantly also guarantees the satisfaction of model properties and increases the robustness of DNNs. This work can be applied to a wide spectrum of applications, including the Internet of Things, smart cities, healthcare, and many others.

    Bio: Meiyi Ma is a Ph.D. candidate in the Department of Computer Science at the University of Virginia, working with Prof. John A. Stankovic and Prof. Lu Feng. Her research interest lies at the intersection of Machine learning, Formal Methods, and Cyber-Physical Systems. Specifically, her work integrates formal methods and machine learning, and applies new integrative solutions to build safe and reliable integrated Cyber-Physical Systems, with a focus on smart city and healthcare applications. Meiyi’s research has been published in top-tier machine learning and cyber-physical systems conferences and journals, including NeurIPS, ACM TCPS, ICCPS, Percom, etc. She has received multiple awards, including the EECS Rising Star at UC Berkeley, the Outstanding Graduate Research Award at the University of Virginia and the Best Master Thesis Award. She is serving as the information director for ACM Transactions on Computing for Healthcare and a reviewer for multiple conferences and journals. She also served as organizing committees for several international workshops.

    Talk Title: Human-AI Collaborative Decision Making on Rehabilitation Assessment

    Talk Abstract: Rehabilitation monitoring systems with sensors and artificial intelligence (AI) provide an opportunity to improve current rehabilitation practices by automatically collecting quantitative data on patient’s status. However, the adoption of these systems still remains a challenge. This paper presents an interactive AI-based system that supports collaborative decision making with therapists for rehabilitation assessment. This system automatically identifies salient features of assessment to generate patient-specific analysis for therapists, and tunes with their feedback. In two evaluations with therapists, we found that our system supports therapists significantly higher agreement on assessment (0.71 average F1-score) than a traditional system without analysis (0.66 average F1-score, p < 0.05). In addition, after tuning with therapist’s feedback, our system significantly improves its performance (from 0.8377 to 0.9116 average F1-scores, p < 0.01). This work discusses the potential of a human and AI collaborative system that supports more accurate decision making while learning from each other’s strengths.

    Bio: Min Lee is a PhD student at Carnegie Mellon University. His research interests lie at the intersection of human-computer interaction (HCI) and machine learning (ML), where he designs, develops, and evaluates human-centered ML systems to address societal problems. His thesis focuses on creating interactive hybrid intelligence systems to improve the practices of stroke rehabilitation (e.g. a decision support system for therapists and a robotic coaching system for post-stroke survivors).

    Talk Title: Mathematical Models of Brain Connectivity and Behavior: Network Optimization Perspectives, Deep-Generative Hybrids, and Beyond

    Talk Abstract: Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder characterized by multiple impairments and levels of disability that vary widely across the ASD spectrum. Currently, quantifying symptom severity relies almost solely on a trained clinician’s evaluation. Recently, neuroimaging studies, for example, using resting state functional MRI (rs-fMRI) and Diffusion Tensor Imaging (DTI) have been gaining popularity for studying brain dysfunction. My work aims at linking the symptomatic characterization of ASD with the functional and structural organization of a patient’s brain via machine learning. To set the stage, I will first introduce a joint network optimization to predict clinical severity from rs-fMRI data. Our model is couples two terms: a generative matrix factorization and a discriminative regression in a joint optimization. Next, we extend this to a deep-generative hybrid, that jointly models the complementarity between structure (DTI) and functional dynamics (dynamic rs-fMRI connectivity) to extract predictive disease biomarkers. The generative part of our framework is now a structurally-regularized matrix factorization on dynamic rs-fMRI correlation matrices, guided by DTI tractography to learn anatomically informed connectivity profiles. The deep part of our framework is an LSTM-ANN, which models the temporal evolution of the scan to map to behavior. Our main novelty lies in our coupled optimization, which collectively estimates the matrix factors and the neural network weights. We outperform several state-of-the-art baselines to extract multi-modal neural signatures of brain dysfunction. Finally, I will present our current exploration based on graph neural networks and manifold learning to better capture the underlying data geometry.

    Bio: Niharika is a PhD candidate in the department of Electrical and Computer Engineering. Her research interests lie at the intersection of deep learning, non-convex optimization, manifold learning and graph signal processing applied to neuroimaging data. She has developed novel machine learning algorithms that predict behavioral deficits in patients with Autism by decoding their brain organization from their functional and structural neuroimaging scans. Prior to joining Hopkins, she obtained a bachelor’s degree (B. Tech with Hons.) in Electrical Engineering with a minor in Electronics and Electrical Communications Engineering from the Indian Institute of Technology, Kharagpur.

    Talk Title: Power Outage Risk Interconnection: Relationship with Social and Environmental Critical Risk Indicators

    Talk Abstract: The interconnections between diverse components in a system can provide profound insights on the health and risk states of the system as a whole. Highly interconnected systems tend to accumulate risks until a large, systemic crisis hits. For example, in the 2007-09 financial crisis, the interconnection of financial institutions heightened near the collapse, suggesting the system could no longer absorb risks. Extending concepts of interconnectedness and systemic risk to coupled human-natural systems, one might expect similar behaviours of risk accumulation and heightened connectivity, leading to potential system failures. The Predictive Risk Investigation System (PRISM) for Multi-layer Dynamic Interconnection Analysis aims to explore the complex interconnectedness and systemic risks in human-natural systems.

    Applying the PRISM approach, we could uncover dynamic relationships and trends in climate resilience and preparedness using Energy, Environmental and Social indicators. This study proposes a case-study application of the PRISM approach to the State of Massachusetts using a dataset of over 130000 power outages in the state from 2013-2018. Random Forest, Locally Weighted Scatterplot Smoothing (LOWESS) and Generalized Additive Models (GAMS) are applied to understand the interconnections between Power outages, Population density and Environmental factors (Weather indicators e.g. Wind Speed, Precipitation).

    Bio: I am a Data Scientist with domain expertise in Energy – Oil, Gas, Renewables and Power Systems. With a BS in Petroleum Engineering and an MS in Sustainable Energy Systems, I have always enjoyed a data-centric approach in solving interdisciplinary problems. In my Bachelor’s degree, I used Neural Networks to solve a practical oil-field (Production Engineering) problem. In my master’s I explored potentials for optimizing clean-energy microgrids in low-income, underserved communities while leveraging insights from large, messy, unstructured data. In my PhD at Tufts I am working in an interdisciplinary team of Data and Domain Scientists where I am applying Data Science/Machine Learning Techniques and Tools to Energy, Climate, Financial, and Ecological systems. One word to describe my experience is diversity. I am fortunate to have enjoyed a fair share of diversity in my academic and professional experience – in geography and in scope. An experience that traverses three continents of the world equipped with a broader scientific and engineering background. This exemplifies my interest in complex, interdisciplinary and multifaceted problems traversing various fields such as: science, engineering and data science. I am enthusiastic about applying my knowledge and skills in Data Science to new, challenging, unfamiliar terrains to discover and garner insights and solve problems that improves experience and affects people, communities, and organizations.

    Talk Title: Data-Efficient Optimization in Reinforcement Learning

    Talk Abstract:Optimization lies at the heart of modern machine learning and data science research. How to design data-efficient optimization algorithms that have a low sample complexity while enjoying a fast convergence at the same time has remained a challenging but imperative topic in machine learning. My research aims to answer this question from two facets: providing the theoretical analysis and understanding of optimization algorithms; and developing new algorithms with strong empirical performance in a principled way. In this talk, I will introduce our recent work in developing and improving data-efficient optimization algorithms for decision-making (reinforcement learning) problems. In particular, I will introduce the variance reduction technique in optimization and show how it can improve the data efficiency of policy gradient methods in reinforcement learning. I will present the variance reduced policy gradient algorithm, which constructs an unbiased policy gradient estimator for the value function. I will show that it provably reduces the sample complexity of vanilla policy gradient methods such as REINFORCE and GPOMDP.

    Bio: Pan Xu is a Ph.D. candidate in the Department of Computer Science at the University of California, Los Angeles. His research spans the areas of machine learning, data science, and optimization, with a focus on the development and improvement of large-scale nonconvex optimization algorithms for machine learning and data science applications. Pan obtained his B.S. degree in mathematics from the University of Science and Technology of China. Pan received the Presidential Fellowship in Data Science from the University of Virginia. He has published over 20 high-quality papers on top machine learning conferences and journals such as ICML, NeurIPS, ICLR, AISTATS, and JMLR.

    Talk Title: Efficient Neural Question Answering for Heterogeneous Platforms

    Talk Abstract: Natural language processing (NLP) systems power many real-world applications like Alexa, Siri, or Google and Bing. Deep learning NLP systems are becoming more effective due to increasingly larger models with multiple layers and millions to billions of parameters. It is challenging to deploy these systems because they are compute-intensive, consume much more energy, and cannot run on mobile devices. In this talk, I will present two works on optimizing efficiency in question answering systems and my current research in studying large NLP models’ energy consumption. First, I will introduce DeQA, which provides an on-device question-answering capability to help mobile users find information more efficiently without privacy issues. Deep learning based QA systems are slow and unusable on mobile devices. We design the latency- and memory- optimizations widely applicable for state-of-the-art QA systems to run locally on mobile devices. Second, I will present DeFormer, a simple decomposition-based technique that takes pre-trained Transformer models and modifies them to enable faster inference for QA for both the cloud and mobile. Lastly, I will introduce how we can accurately measure the energy consumption of NLP models using hardware power meters and build reliable energy estimation models by abstracting meaningful features of the NLP workloads and profiling runtime resource usage.

    Bio: Qingqing Cao is a graduating Computer Science Ph.D. candidate at Stony Brook University. His research interests include natural language processing (NLP), mobile computing, and machine learning systems. He has focused on building efficient and practical NLP systems for both edge devices and the cloud, such as on-device question answering (MobiSys 2019), faster Transformer models (ACL 2020), and accurate energy estimation of NLP models. He has two fantastic advisors: Prof. Aruna Balasubramanian and Prof. Niranjan Balasubramanian. He is looking for postdoc openings in academia or research positions in the industry.

    Talk Title: Artificial Intelligence for Medical Image Analysis for Breast Cancer Multiparametric MRI

    Talk Abstract: Artificial intelligence is playing an increasingly important role in medical imaging. Computer-aided diagnosis (CADx) systems using human-engineered features or deep learning can potentially assist radiologists in image interpretation by extracting quantitative biomarkers to improve diagnostic performance and circumvent unnecessary invasive procedures. Multiparametric MRI (mpMRI) has become a part of routine clinical assessment for screening of high-risk patients for breast cancer and monitoring therapy response because it has been shown to improve diagnostic accuracy. Current CADx methods for breast lesion assessment on MRI, however, are mostly focused on one sequence, the dynamic contrast-enhanced (DCE)-MRI. Therefore, we investigated methods for incorporating three sequences in mpMRI to improve the CADx performance in differentiating benign and malignant breast lesions. We compared integrating the mpMRI information at the image level, feature level, or classifier output level. In addition, transfer learning is often employed in deep learning applications in medical imaging due to data scarcity. However, pretrained convolutional neural networks (CNNs) used in transfer learning require two-dimensional (2D) inputs, limiting the ability to utilize high-dimensional information in medical imaging. To address this problem, we investigated a transfer learning method that collapses volumetric information to 2D by taking the maximum intensity projection (MIP) at the feature level within CNNs, which outperformed a previous method of using MIPs of images themselves in the task of distinguishing between benign and malignant breast lesions. We proposed a method that combines feature fusion and feature MIP for computer-aided breast cancer diagnosis using high-dimensional mpMRI that outperforms the current benchmarks.

    Bio: Isabelle is a PhD candidate in Medical Physics at the University of Chicago, supervised by Dr. Maryellen Giger. Her research is centered around developing automated methods for quantitative medical image analysis to assist in clinical decision-making. She has proposed novel methodologies to diagnoses breast cancer using multiparametric MRI exams. Since the pandemic, she has also been working on AI solutions that leverage medical images to enhance the early detection and prognosis of COVID-19. She has first-hand experience tackling unique challenges faced by medical imaging applications of machine learning due to high-dimensionality, data scarcity, noisy labels, etc. She loves working at the intersection of physics, medicine, and data science, and she is motivated by the profound potential impact that her research can bring on improving access to high-quality care and providing a proactive healthcare system. She hopes to dedicate her career to building AI-empowered technology to transform healthcare, accelerate scientific discoveries, and improving human well-being.

    Talk Title: Asymptotically Optimal Exact Minibatch Metropolis-Hastings

    Talk Abstract: Metropolis-Hastings (MH) is one of the most fundamental Bayesian inference algorithms, but it can be intractable on large datasets due to requiring computations over the whole dataset. In this talk, I will discuss minibatch MH methods, which use subsamples to enable scaling. First, I will talk about existing minibatch MH methods, and demonstrate that inexact methods (i.e. they may change the target distribution) can cause arbitrarily large errors in inference. Then, I will introduce a new exact minibatch MH method, TunaMH, which exposes a tunable trade-off between its batch size and its theoretically guaranteed convergence rate. Finally, I will present a lower bound on the batch size that any minibatch MH method must use to retain exactness while guaranteeing fast convergence—the first such bound for minibatch MH—and show TunaMH is asymptotically optimal in terms of the batch size.

    Bio: Ruqi Zhang is a fifth-year Ph.D. student in Statistics at Cornell University, advised by Professor Chris De Sa. Her research interests lie in probabilistic modeling for data science and machine learning. She currently focuses on developing fast and robust inference methods with theoretical guarantees and their applications with modern model architectures, such as deep neural networks, on real-world big data. Her work has been published in top machine learning venues such as NeurIPS, ICLR and AISTATS, and has been recognized through an Oral Award at ICLR and two Spotlight Awards at NeurIPS.

    Talk Title: Towards Global-Scale Biodiversity Monitoring – Scaling Geospatial and Taxonomic Coverage Using Contextual Clues

    Talk Abstract: Biodiversity is declining globally at unprecedented rates. We need to monitor species in real time and in greater detail to quickly understand which conservation efforts are most effective and take corrective action. Current ecological monitoring systems generate data far faster than researchers can analyze it, making scaling up impossible without automated data processing. However, ecological data collected in the field presents a number of challenges that current methods, like deep learning, are not designed to tackle. Biodiversity data is correlated in time and space, resulting in overfitting and poor generalization to new sensor deployments. Environmental monitoring sensors have limited intelligence, resulting in objects of interest that are often too close/far, blurry, or in clutter. Further, the distribution of species is long-tailed, which results in highly-imbalanced datasets. These challenges are not unique to the natural world, advances in any one of these areas will have far-reaching impact across domains. To address these challenges, we take inspiration from the value of additional contextual information for human experts, and seek to incorporate it within the structure of machine learning systems. Incorporating species distributions and access across data collected within a sensor at inference time can improve generalization to new sensors without additional human data labeling. Going beyond single sensor deployment, there is a large degree of contextual information shared across multiple data streams. Our long-term goal is to develop learning methods that efficiently and adaptively benefit from many different data streams on a global scale.

    Bio: Sara Beery has always been passionate about the natural world, and she saw a need for technology-based approaches to conservation and sustainability challenges. This led her to pursue a PhD at Caltech, where she is advised by Pietro Perona and funded by an NSF Graduate Research Fellowship, a PIMCO Fellowship in Data Science, and an Amazon/Caltech AI4Science Fellowship. Her research focuses on computer vision for global-scale biodiversity monitoring. She works closely with Microsoft AI for Earth and Google Research to translate her work into usable tools, including widely-used models and benchmarks for detection and recognition of animal species in challenging camera trap data at a global scale. She has worked to bridge the interdisciplinary gap between ecology and computer science by hosting the iWild-Cam challenge at the FGVC Workshop at CVPR from 2018-2021, and through founding and managing a highly successful AI for Conservation slack channel which provides a meeting point for experts from each community to discuss new methods and best practices for conservation technology. Sara’s prior experience as a professional ballerina and a nontraditional student has taught her the value of unique and diverse perspectives in the research community. She’s passionate about increasing diversity and inclusion in STEM through mentorship and outreach.

    Talk Title: Promoting Worker Performance with Human-Centered Data Science

    Talk Abstract: Addressing real-world problems about human behavior is one of the main approaches where advances in data science techniques and social science theories achieve the greatest social impact. To approach these problems, we propose a human-centered data science framework that synergizes strengths across machine learning, causal inference, field experiment, and social science theories to understand, predict, and intervene in human behavior. In this talk, I will present three empirical studies that promote worker performance with human-centered data science. In the first project, we work with New York City’s Mayor’s Office and deploy explainable machine learning models to predict the risk of tenant harassment in New York City. In the second project, we leverage insights from social identity theory and conduct a large-scale field experiment on DiDi, a leading ride-sharing platform, showing that the intervention of bonus-free team ranking/contest systems can improve driver engagement. Third, to further unpack the effect of team contests on individual DiDi drivers, we bring together causal inference, machine learning, and social science theories to predict individual treatment effects. Insights from this study are directionally actionable to improve team recommender systems and contest design. More promising future directions will be discussed to showcase the effectiveness and flexibility of this framework.

    Bio: I am a final-year Ph.D. candidate at the School of Information, University of Michigan, Ann Arbor, working with Professor Qiaozhu Mei. My research focuses on human-centered data science, where I couple data science techniques and social science theories to address real-world problems by understanding, predicting, and intervening in human behavior.

    Specifically, I synergize strengths across machine learning, causal inference, field experiments, and social science theories to solve practical problems in the areas of data science for social good, the sharing economy, crowdsourcing, crowdfunding, social media, and health. For example, we have collaborated with the New York City’s Mayor’s Office and helped to prioritize government outreach to tenants vulnerable to landlord harassment in New York City by deploying machine learning models. In collaboration with Didi Chuxing, a leading ride-sharing platform, we have leveraged field experiments and machine learning models to enhance driver engagement and intervention design. The results of my work have been integrated into the real-world products that involve millions of users and have been published across data mining, social computing, and human-computer interaction venues.

    Talk Title: PAPRIKA: Private Online False Discovery Rate Control

    Talk Abstract: In hypothesis testing, a false discovery occurs when a hypothesis is incorrectly rejected due to noise in the sample. When adaptively testing multiple hypotheses, the probability of a false discovery increases as more tests are performed. Thus the problem of False Discovery Rate (FDR) control is to find a procedure for testing multiple hypotheses that accounts for this effect in determining the set of hypotheses to reject. The goal is to minimize the number (or fraction) of false discoveries, while maintaining a high true positive rate (i.e., correct discoveries).
    In this work, we study False Discovery Rate (FDR) control in multiple hypothesis testing under the constraint of differential privacy for the sample. Unlike previous work in this direction, we focus on the online setting, meaning that a decision about each hypothesis must be made immediately after the test is performed, rather than waiting for the output of all tests as in the offline setting. We provide new private algorithms based on state-of-the-art results in non-private online FDR control. Our algorithms have strong provable guarantees for privacy and statistical performance as measured by FDR and power. We also provide experimental results to demonstrate the efficacy of our algorithms in a variety of data environments.

    Bio: Wanrong Zhang is a PhD candidate at Georgia Tech supervised by Rachel Cummings and Yajun Mei. Her research interests lie primarily in data privacy, with connections to statistics and machine learning. Her research focuses on designing privacy-preserving algorithms for machine learning models and statistical analysis tools, as well as identifying and preventing privacy vulnerabilities in modern collaborative learning. Before joining Georgia Tech, she received her B.S. in Statistics from Peking University.

    Talk Title: Towards Better Informed Extraction of Events from Documents

    Talk Abstract: Large amounts of text are written and published daily on-line. As a result, applications such as reading through the document to automatically extract useful information, and answering user questions have become increasingly needed for people’s efficient absorption of information. In this talk, I will focus on the problem of finding and organizing information about events and introduce my recent research on document-level event extraction. Firstly, I’ll briefly summarize the high-level goal and several key challenges (including modeling context and better leveraging background knowledge), as well as my efforts to tackle them. Then I will focus on the work where we formulate event extraction as a question answering problem — both to access relevant knowledge encoded in large models and to reduce the cost of human annotation required for training data creation/construction.

    Bio: Xinya Du is a Ph.D. candidate at the Computer Science Department of Cornell University, advised by Prof. Claire Cardie. He received a bachelor degree in Computer Science from Shanghai Jiao Tong University. His research is on natural language processing, especially methods that enable learning with fewer annotations for document-level information extraction. His work has been published in leading NLP conferences and has been covered by New Scientist and TechRepublic.

    Talk Title: Understanding Success and Failure in Science and Technology

    Talk Abstract: The 21st century society is largely driven by science and innovation, but our quantitative understanding of why, how, and when innovators and innovations succeed or fail remains limited. Despite the long-standing interest in this topic, current science of science research relies on citation and publication records as its major data sources. Yet science functions as a complex system that is much more than published papers, and ignorance of this multidimensional nature precludes a deeper examination of many fundamental elements of innovation lifecycles, from failure to scientific breakthrough, from public funding to broad impact. In this talk, I will touch on a few examples of success and failure across science and technology, hoping to illustrate a way for a better understating of the full innovation lifecycle. By combining various large-scale datasets and interdisciplinary analytical frameworks rooted in data mining, statistical physics, and computational social science, we discover a series of fundamental mechanisms and signals underlying the processes in which (1) individuals and organizations build on previous repeated failures towards ultimate victory or defeat in science, startups and security; (2) scientific elites produce breakthrough discoveries in their scientific careers; and (3) scientific research gets funded and used by the general public. The uncovered patterns in these studies not only unveil regularity and predictability underlying the often-noisy social systems, they also offer a new theoretical and empirical basis that is practically relevant for individual scientists, research institutes, and innovation policymakers.

    Bio: Yian Yin is a Ph.D. candidate of Industrial Engineering & Management Sciences at Northwestern University, advised by Dashun Wang and Noshir Contractor. He also holds affiliations with Northwestern Institute on Complex Systems and Center for Science of Science and Innovation. Prior to joining Northwestern, he received his bachelor degrees in Statistics and Economics from Peking University in 2016.

    Yian studies computational social science, with a particular focus on integrating theoretical insights in innovation studies, computational tools in data science, and modeling frameworks in complex systems to examine various fundamental elements of innovation lifecycles, from dynamics of failure to emergence of scientific breakthrough, from public funding for science to broad uses of science in public domains. His research has been published in multidisciplinary journals including Science, Nature, Nature Human Behaviour, and Nature Reviews Physics, and has been featured in Science, Lancet, Forbes, Washington Post, Scientific American, Harvard Business Review, MIT Technology Review, among other outlets.

    Talk Title: Towards Interpretable Machine Learning by Human Knowledge Reasoning

    Talk Abstract: Given the great success achieved by statistical learning theories for building intelligent systems, there is still a long-standing challenge of artificial intelligence, which is to bridge the gaps between what machines know, what humans think what machines know, and what humans know, about the real world. By doing so, we are expected to ground the prior knowledge of machines to human knowledge first and perform explicit reasoning for various downstream tasks for better interpretable machine learning. 

    In this talk, I will briefly present two pieces of my existing work that leverage human expert and commonsense knowledge reasoning to increase the interpretability and transparency of machine learning models in the field of natural language processing. Firstly, I will show how existing cognitive theories on human memory can inspire an interpretable framework for rationalizing the medical relation prediction task based on expert knowledge. Secondly, I will introduce how we can learn better word representations based on commonsense knowledge and reasoning. Our proposed framework learns a commonsense reasoning module guided by a self-supervision task and provides word pair and single word representations distilled from learned reasoning modules. Both the above works are able to offer reasoning paths to justify their decisions and boost the model interpretability that humans can understand with minimal knowledge barriers.

    Bio: Zhen Wang is a Ph.D. student in the Department of Computer Science and Engineering at the Ohio State University advised by Prof. Huan Sun. His research centers on natural language processing, data mining, and machine learning with emphasis on information extraction, question answering, graph learning, text understanding, and interpretable machine learning. Particularly, he is interested in improving the trustworthiness and generalizability of data-driven machine learning models by interpretable and robust knowledge representation and reasoning. He has published papers in several top-tier data science conferences, such as KDD, ACL, WSDM as well as journals like Bioinformatics. He conducts interdisciplinary research that connects artificial intelligence with cognitive neuroscience, linguistics, software engineering, and medical informatics, etc.

  • Speakers + Panelists

    Workshop Dates: January 11-12th, 2021 | Workshop Agenda

    Opening Remarks

    Angela V. Olinto is Dean of the Division of the Physical Sciences and the Albert A. Michelson Distinguished Service Professor in the Department of Astronomy and Astrophysics, the Kavli Institute for Cosmological Physics, and the Enrico Fermi Institute at the University of Chicago. She previously served as Chair of the Department of Astronomy and Astrophysics from 2003 to 2006 and again from 2012 to 2017.

    Olinto is best known for her contributions to the study of the structure of neutron stars, primordial inflationary theory, cosmic magnetic fields, the nature of the dark matter, and the origin of the highest energy cosmic rays, gamma-rays, and neutrinos. She is the Principal Investigator of the POEMMA (Probe Of Extreme Multi-Messenger Astrophysics) space mission and the EUSO (Extreme Universe Space Observatory) on a super pressure balloon (SPB) mission, and a member of the Pierre Auger Observatory, all designed to discover the origin of the highest energy cosmic particles, their sources, and their interactions.

    Olinto received a B.S. in Physics from the Pontifícia Universidade Católica of Rio de Janeiro, Brazil in 1981, and Ph.D. in Physics from the Massachusetts Institute of Technology in 1987. She is a fellow of the American Physical Society and of the American Association for the Advancement of Science, was a trustee of the Aspen Center for Physics, and has served on many advisory committees for the National Academy of Sciences, Department of Energy, National Science Foundation, and the National Aeronautics and Space Administration. She received the Chaire d’Excellence Award of the French Agence Nationale de Recherche in 2006, the Llewellyn John and Harriet Manchester Quantrell Award for Excellence in Undergraduate Teaching in 2011, and the Faculty Award for Excellence in Graduate Teaching in 2015 at the University of Chicago.

    Nick Feamster is Neubauer Professor in the Department of Computer Science and the College. He researches computer networking and networked systems, with a particular interest in Internet censorship, privacy, and the Internet of Things. His work on experimental networked systems and security aims to make networks easier to manage, more secure, and more available.

    Homepage

    Fireside Chat (1/11)

    Rebecca Willett is a Professor of Statistics and Computer Science at the University of Chicago. She completed her PhD in Electrical and Computer Engineering at Rice University in 2005 and was an Assistant then tenured Associate Professor of Electrical and Computer Engineering at Duke University from 2005 to 2013. She was an Associate Professor of Electrical and Computer Engineering, Harvey D. Spangler Faculty Scholar, and Fellow of the Wisconsin Institutes for Discovery at the University of Wisconsin-Madison from 2013 to 2018.  Prof. Willett received the National Science Foundation CAREER Award in 2007, was a member of the DARPA Computer Science Study Group 2007-2011, and received an Air Force Office of Scientific Research Young Investigator Program award in 2010. Prof. Willett has also held visiting researcher positions at the Institute for Pure and Applied Mathematics at UCLA in 2004, the University of Wisconsin-Madison 2003-2005, the French National Institute for Research in Computer Science and Control (INRIA) in 2003, and the Applied Science Research and Development Laboratory at GE Medical Systems (now GE Healthcare) in 2002. Her research interests include network and imaging science with applications in medical imaging, wireless sensor networks, astronomy, and social networks. She is also an instructor for FEMMES (Females Excelling More in Math Engineering and Science; news article here) and a local exhibit leader for Sally Ride Festivals. She was a recipient of the National Science Foundation Graduate Research Fellowship, the Rice University Presidential Scholarship, the Society of Women Engineers Caterpillar Scholarship, and the Angier B. Duke Memorial Scholarship.

    Homepage

    Julia Hanson is a Ph.D. student in the Department of Computer Science studying computer security and privacy. This will be Julia’s first summer as a lab coordinator for the CDAC Summer Internship Program.

    Panel: The Future of Data Science (1/11)

    Michael J. Franklin is the inaugural holder of the Liew Family Chair of Computer Science. An authority on databases, data analytics, data management and distributed systems, he also serves as senior advisor to the provost on computation and data science.

    Previously, Franklin was the Thomas M. Siebel Professor of Computer Science and chair of the Computer Science Division of the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley. There, he co-founded Berkeley’s Algorithms, Machines and People Laboratory (AMPLab), a leading academic big data analytics research center. The AMPLab won a National Science Foundation CISE “Expeditions in Computing” award, which was announced as part of the White House Big Data Research initiative in March 2012, and received support from over 30 industrial sponsors. AMPLab created industry-changing open source Big Data software including Apache Spark and BDAS, the Berkeley Data Analytics Stack. At Berkeley, he also served as an executive committee member for the Berkeley Institute for Data Science, a campus-wide initiative to advance data science environments.

    An energetic entrepreneur in addition to his academic work, Franklin founded and became chief technology officer of Truviso, a data analytics company acquired by Cisco Systems. He serves on the technical advisory boards of various data-driven technology companies and organizations.

    Franklin is a Fellow of the Association for Computing Machinery and a two-time recipient of the ACM SIGMOD (Special Interest Group on Management of Data) “Test of Time” award. His many other honors include the outstanding advisor award from Berkeley’s Computer Science Graduate Student Association. He received the Ph.D. in Computer Science from the University of Wisconsin in 1993, a Master of Software Engineering from the Wang Institute of Graduate Studies in 1986, and the B.S. in Computer and Information Science from the University of Massachusetts in 1983.

    Homepage

    Katherine Baicker, a leading scholar in the economic analysis of health policy, commenced as Dean and the Emmett Dedmon Professor at the University of Chicago Harris School of Public Policy on August 15, 2017.

    Baicker’s research focuses on the effectiveness of public and private health insurance, including the effect of reforms on the distribution and quality of care.  Her large scale research projects include the Oregon Health Insurance Experiment, a randomized evaluation of the effects of Medicaid coverage. Her research has been published in journals such as the New England Journal of Medicine, Science, Health Affairs, JAMA, and the Quarterly Journal of Economics.

    Baicker is an elected member of the National Academy of Medicine (IOM), the National Academy of Social Insurance, the Council on Foreign Relations, and the American Academy of Arts and Sciences.  She holds appointments as a research associate at the National Bureau of Economic Research and as an affiliate of the Abdul Latif Poverty Action Lab.  She serves on the Congressional Budget Office’s Panel of Health Advisers, on the Advisory Board of the National Institute for Health Care Management, as a Trustee of the Mayo Clinic and of NORC, and on the Board of Directors of Eli Lilly and of HMS.

    Before coming to the University of Chicago, Baicker was the C. Boyden Gray Professor of Health Economics in the Department of Health Policy and Management at the Harvard T.H. Chan School of Public Health. She has served as Chair of the Massachusetts Group Insurance Commission; Chair of the Board of Directors of AcademyHealth; Commissioner on the Medicare Payment Advisory Commission; and a nonresident senior fellow of the Brookings Institution. From 2005-2007, she served as a Senate-confirmed Member of the President’s Council of Economic Advisers, where she played a leading role in the development of health policy.

    Baicker earned her B.A. in economics from Yale and her Ph.D. in economics from Harvard.

    My main research interests are in speech and language processing, as well as related aspects of machine learning.

    I am an Associate Professor at TTI-Chicago, a philanthropically endowed academic computer science institute located on the University of Chicago campus. We are recruiting students to our PhD program and visiting student program, as well as additional faculty, including in speech and language-related areas (more on Speech and Language at TTIC).

    I completed my PhD in 2005 at MIT in the Spoken Language Systems group of the Computer Science and Artificial Intelligence Laboratory. In 2005-2007 I was a post-doctoral lecturer in the MIT EECS department. In Feb.-Aug. 2008 I was a Research Assistant Professor at TTI-Chicago.

    Homepage

    Dan Nicolae obtained his Ph.D. in statistics from The University of Chicago and has been a faculty at the same institution since 1999, with appointments in Statistics (since 1999) and Medicine (since 2006). His research focus is on developing statistical and computational methods for understanding the human genetic variation and its influence on the risk for complex traits, with an emphasis on asthma related phenotypes. The current focus in his statistical genetics research is centered on data integration and system-level approaches using large datasets that include clinical and environmental data as well as various genetics/genomics data types: DNA variation, gene expression (RNA-seq), methylation and microbiome.

    Homepage

    David Uminsky joined the University of Chicago in September 2020 as a senior research associate and Executive Director of Data Science. He was previously an associate professor of Mathematics and Executive Director of the Data Institute at University of San Francisco (USF). His research interests are in machine learning, signal processing, pattern formation, and dynamical systems.  David is an associate editor of the Harvard Data Science Review.  He was selected in 2015 by the National Academy of Sciences as a Kavli Frontiers of Science Fellow. He is also the founding Director of the BS in Data Science at USF and served as Director of the MS in Data Science program from 2014-2019. During the summer of 2018, David served as the Director of Research for the Mathematical Science Research Institute Undergrad Program on the topic of Mathematical Data Science.

    Before joining USF he was a combined NSF and UC President’s Fellow at UCLA, where he was awarded the Chancellor’s Award for outstanding postdoctoral research. He holds a Ph.D. in Mathematics from Boston University and a BS in Mathematics from Harvey Mudd College.

    Panel: Interdisciplinary Data Science (1/11)

    Luís M. A. Bettencourt is the Pritzker Director of the Mansueto Institute for Urban Innovation and Professor of Ecology and Evolution at the University of Chicago, as well as an External Professor of Complex Systems at the Santa Fe Institute. He was trained as a theoretical physicist and obtained his undergraduate degree from Instituto Superior Técnico (Lisbon, Portugal) in 1992, and his PhD from Imperial College (University of London, UK) in 1996 for research in statistical and high-energy physics models of the early Universe.  He has held postdoctoral positions at the University of Heidelberg (Germany), Los Alamos National Laboratory (Director’s Fellow and Slansky Fellow) and at MIT (Center for Theoretical Physics). He has worked extensively on complex systems theory and on cities and urbanization, in particular. His research emphasizes the creation of new interdisciplinary synthesis to describe cities in quantitative and predictive ways, informed by classical theory from various disciplines and the growing availability of empirical data worldwide. He is the author of over 100 scientific papers and several edited books. His research has been featured in leading media venues, such as the New York Times, Nature, Wired, New Scientist, and the Smithsonian.

    Homepage

    Marshini Chetty is an assistant professor in the Department of Computer Science at the University of Chicago, where she co-directs the Amyoli Internet Research Lab or AIR lab. She specializes in human-computer interaction, usable privacy and security, and ubiquitous computing. Marshini designs, implements, and evaluates technologies to help users manage different aspects of Internet use from privacy and security to performance, and costs. She often works in resource-constrained settings and uses her work to help inform Internet policy. She has a Ph.D. in Human-Centered Computing from Georgia Institute of Technology, USA and a Masters and Bachelors in Computer Science from the University of Cape Town, South Africa. In her former lives, Marshini was on the faculty in the Computer Science Department at Princeton University and the College of Information Studies at the University of Maryland, College Park. Her work has won best paper awards at SOUPS, CHI, and CSCW and has been funded by the National Science Foundation, the National Security Agency, Intel, Microsoft, Facebook, and multiple Google Faculty Research Awards.

    Homepage.

    My research focuses on the collective system of thinking and knowing, ranging from the distribution of attention and intuition, the origin of ideas and shared habits of reasoning to processes of agreement (and dispute), accumulation of certainty (and doubt), and the texture—novelty, ambiguity, topology—of understanding. I am especially interested in innovation—how new ideas and practices emerge—and the role that social and technical institutions (e.g., the Internet, markets, collaborations) play in collective cognition and discovery. Much of my work has focused on areas of modern science and technology, but I am also interested in other domains of knowledge—news, law, religion, gossip, hunches, machine and historical modes of thinking and knowing. I support the creation of novel observatories for human understanding and action through crowd sourcing, information extraction from text and images, and the use of distributed sensors (e.g., RFID tags, cell phones). I use machine learning, generative modeling, social and semantic network representations to explore knowledge processes, scale up interpretive and field-methods, and create alternatives to current discovery regimes.

    My research has been supported by the National Science Foundation, the National Institutes of Health, the Air Force office of Science Research, and many philanthropic sources, and has been published in NatureScienceProceedings of the National Academy of ScienceAmerican Journal of SociologyAmerican Sociological ReviewSocial Studies of ScienceResearch Policy, Critical Theory, Administrative Science Quarterly, and other outlets. My work has been featured in the EconomistAtlantic MonthlyWiredNPRBBCEl PaísCNN, Le Monde, and many other outlets.

    At Chicago, I am Director of Knowledge Lab, which has collaborative, granting and employment opportunities, as well as ongoing seminars. I also founded and now direct on the Computational Social Science program at Chicago, and sponsor an associated Computational Social Science workshop. I teach courses in augmented intelligence, the history of modern science, science studies, computational content analysis, and Internet and Society. Before Chicago, I received my doctorate in sociology from Stanford University, served as a research associate in the Negotiation, Organizations, and Markets group at Harvard Business School, started a private high school focused on project-based arts education, and completed a B. A. in Anthropology at Brigham Young University.

    Homepage

    Eamon Duede is a joint PhD Candidate in the departments of Philosophy and the Committee on Conceptual and Historical Studies of Science, and was formerly the Executive Director of the Knowledge Lab. His work is broadly at the intersection of the philosophy of science and computational science of science. In the philosophy of science, he focuses on models, simulations, and artificial intelligence / machine learning in science. In computational science of science, he uses large scale, computational analysis alongside targeted intelligent surveying and field experiments to understand how institutions and communities produce knowledge.

    Panel: Postdoctoral Research in Data Science (1/12)

    Tarun Mangla will join CDAC as a postdoctoral scholar in summer 2020, and is currently a PhD student in the School of Computer Science at the Georgia Institute of Technology, co-advised by Mostafa Ammar and Ellen Zegura. His research interests span video streaming, network measurements, and cellular networks. He completed his bachelors in Computer Science and Engineering from Indian Institute of Technology, Delhi (2014) and MS in Computer Science from Georgia Tech (2018). He is a recipient of the Best Paper Award at IFIP TMA, 2018.

    Anna Woodard is a postdoctoral scholar in the Department of Computer Science at the University of Chicago, where she is part of Globus Labs.

    Qinyun Lin is a postdoctoral fellow at the Center for Spatial Data Science. Her research interests include sensitivity analysis, causal inference, mediation analysis, social network analysis and multi-level models. Her dissertation proposes sensitivity analysis techniques for presence of spillover effects and heterogeneous treatment effects in multi-site randomized control trials. Her dissertation work also looks at unobserved mediator as a post-treatment confounder in causal mediation analysis. Her current research applies a spatial perspective to look at access to medications for opioid use disorder and how it affects opioid-related deaths and HCV infections.

    I am a Postdoctoral Researcher at Chicago Booth, working on questions of social perception (with Alex Todorov). My primary research program explores the default assumptions wired into the mind, especially in the context of perception. It turns out that what we see is a process of unconscious inferences, where we take into account not only the exact nature of the light entering our eyes, but also a set of assumptions about the source that most likely generated or reflected that light. It is now becoming possible to reveal the nature of these assumptions through various techniques. One of my favorite techniques is the method of serial reproduction — essentially the children’s game of ‘Broken Telephone’ — which I use to explore our default assumptions across several visual contexts, ranging from faces to intuitive physics.

    Most recently, I’ve also been working on building new face perception models, such that we can generate hyper-realistic faces, and manipulate both synthetic and real faces along social traits of psychological interest.

    Prior to joining Booth, I started out my postdoc at Princeton (still with Alex). Before that, I did my graduate work at Yale with Brian Scholl at the Perception & Cognition lab. I also spent time as a research assistant working with Won Mok Shim in the Visual Perception & Cognition Lab at Dartmouth College, which is also my undergraduate alma mater. I majored in Cognitive Science, Japanese Studies, and “A Cappella Studies”. In my spare time, I also fight fake news.

    Jamie Saxon will join CDAC as a postdoctoral scholar in summer 2020, and was previously a postdoctoral fellow with the Harris School of Public Policy and the Center for Spatial Data Science of the University of Chicago.

    He uses large data sources to measure the availability and use of civic and social resources in American cities. He is particularly interested in mobility among neighborhoods and the consequences of this mobility. He has also studied how gerrymandering affects representation, and developed powerful automated districting software.

    He is committed to developing resources for computational social science research, and has taught programming and statistics for masters’ students in public policy.

    He was trained as a particle physicist and was previously an Enrico Fermi Fellow on the ATLAS Experiment on CERN’s Large Hadron Collider at the Enrico Fermi Institute. He worked for many years on electronics and firmware for measuring and reconstructing particle trajectories. As a graduate student at the University of Pennsylvania, he made noteworthy contributions to the discovery and first measurements of the Higgs Boson in the two-photon channel.

    His CV is here; you can also find him on LinkedIn or GitHub.

    Panel: Careers in Data Science (1/12)

    Heather Zheng is the Neubauer Professor of Computer Science at University of Chicago. She received my PhD in Electrical and Computer Engineering from University of Maryland, College Park in 1999. Prior to joining University of Chicago in 2017, she spent 6 years in industry labs (Bell-Labs, NJ and Microsoft Research Asia), and 12 years at University of California at Santa Barbara. At UChicago, she co-directs the SAND Lab (Systems, Algorithms, Networking and Data) together with Prof. Ben Y. Zhao.

    Homepage.

    Chenhao Tan is an assistant professor at the Department of Computer Science and the Department of Information Science (by courtesy) at University of Colorado Boulder. His main research interests include language and social dynamics, human-centered machine learning, and multi-community engagement. He is also broadly interested in computational social science, natural language processing, and artificial intelligence.

    Homepage.

    Maryellen L. Giger, Ph.D. is the A.N. Pritzker Professor of Radiology, Committee on Medical Physics, and the College at the University of Chicago. She is also the Vice-Chair of Radiology (Basic Science Research) and the immediate past Director of the CAMPEP-accredited Graduate Programs in Medical Physics/ Chair of the Committee on Medical Physics at the University. For over 30 years, she has conducted research on computer-aided diagnosis, including computer vision, machine learning, and deep learning, in the areas of breast cancer, lung cancer, prostate cancer, lupus, and bone diseases.

    Over her career, she has served on various NIH, DOD, and other funding agencies’ study sections, and is now a member of the NIBIB Advisory Council of NIH. She is a former president of the American Association of Physicists in Medicine and a former president of the SPIE (the International Society of Optics and Photonics) and was the inaugural Editor-in-Chief of the SPIE Journal of Medical Imaging. She is a member of the National Academy of Engineering (NAE) and was awarded the William D. Coolidge Gold Medal from the American Association of Physicists in Medicine, the highest award given by the AAPM. She is a Fellow of AAPM, AIMBE, SPIE, SBMR, and IEEE, a recipient of the EMBS Academic Career Achievement Award, and is a current Hagler Institute Fellow at Texas A&M University. In 2013, Giger was named by the International Congress on Medical Physics (ICMP) as one of the 50 medical physicists with the most impact on the field in the last 50 years. In 2018, she received the iBIO iCON Innovator award.

    She has more than 200 peer-reviewed publications (over 300 publications), has more than 30 patents and has mentored over 100 graduate students, residents, medical students, and undergraduate students. Her research in computational image-based analyses of breast cancer for risk assessment, diagnosis, prognosis, and response to therapy has yielded various translated components, and she is now using these image-based phenotypes, i.e., these “virtual biopsies” in imaging genomics association studies for discovery.

    She is a cofounder, equity holder, and scientific advisor of Quantitative Insights, Inc., which started through the 2009-2010 New Venture Challenge at the University of Chicago. QI produces QuantX, the first FDA-cleared, machine-learning-driven system to aid in cancer diagnosis (CADx). In 2019, QuantX was named one of TIME magazine’s inventions of the year.

    Homepage.

    Bryon Aragam is an Assistant Professor and Topel Faculty Scholar in the Booth School of Business at the University of Chicago. He studies high-dimensional statistics, machine learning, and optimization. His research focuses on mathematical aspects of data science and statistical machine learning in nontraditional settings. Some of his recent projects include problems in graphical modeling, nonparametric statistics, personalization, nonconvex optimization, and high-dimensional inference. He is also involved with developing open-source software and solving problems in interpretability, ethics, and fairness in artificial intelligence. His work has been published in top statistics and machine learning venues such as the Annals of Statistics, Neural Information Processing Systems, the International Conference on Machine Learning, and the Journal of Statistical Software.

    Prior to joining the University of Chicago, he was a project scientist and postdoctoral researcher in the Machine Learning Department at Carnegie Mellon University. He completed his PhD in Statistics and a Masters in Applied Mathematics at UCLA, where he was an NSF graduate research fellow. Bryon has also served as a data science consultant for technology and marketing firms, where he has worked on problems in survey design and methodology, ranking, customer retention, and logistics.

    Homepage.

    Marynia Kolak, MS, MFA, PhD, is a health geographer using open science tools and an exploratory data analytic approach to investigate issues of equity across space and time. Her research centers on how “place” impacts health outcomes in different ways, for different people, from opioid risk environments to chronic disease clusters. She focuses on quantifying and distilling the structural determinants of health across different environments, tying political ecology models of public health with geocomputational methods and quasi-experimental policy evaluation techniques. She received the 2017 Concordium Innovation Award at AcademyHealth for her open-source visualization of Chicago determinants of health, and “Highest Impact” award in the Prevention Category at the American College of Cardiology 2019 conference for her work in connecting chronic disease rates with social determinants of health. She serves as the Co-I and spatial analytic lead in the ETHIC project investigating the opioid epidemic in Illinois. She is the Assistant Director of Health Informatics and Lecturer in GIScience at the Center for Spatial Data Science, University of Chicago, and serves as a Public Service Intern at the Chicago Department of Public Health. Marynia additionally serves as an Health and Medical Specialty Group (AAG) board member, and chair of the Chicago Public Health GIS Network. She received her Ph.D in Geography at ASU, M.F.A in Writing from Roosevelt University, M.S. in GIS from John Hopkins University, and B.S. in Geology from the University of Illinois at Urbana-Champaign.

    Eric Jonas is a new professor in the Department of Computer Science at the University of Chicago. His research interests include biological signal acqusition, inverse problems, machine learning, heliophysics, neuroscience, and other exciting ways of exploiting scalable computation to understand the world. Previously he was at the Berkeley Center for Computational Imaging and RISELab at UC Berkeley EECS working with Ben Recht.

    Homepage.

  • Committee

    Rising Stars Committee

    Bryon Aragam is an Assistant Professor and Topel Faculty Scholar in the Booth School of Business at the University of Chicago. He studies high-dimensional statistics, machine learning, and optimization. His research focuses on mathematical aspects of data science and statistical machine learning in nontraditional settings. Some of his recent projects include problems in graphical modeling, nonparametric statistics, personalization, nonconvex optimization, and high-dimensional inference. He is also involved with developing open-source software and solving problems in interpretability, ethics, and fairness in artificial intelligence. His work has been published in top statistics and machine learning venues such as the Annals of Statistics, Neural Information Processing Systems, the International Conference on Machine Learning, and the Journal of Statistical Software.

    Prior to joining the University of Chicago, he was a project scientist and postdoctoral researcher in the Machine Learning Department at Carnegie Mellon University. He completed his PhD in Statistics and a Masters in Applied Mathematics at UCLA, where he was an NSF graduate research fellow. Bryon has also served as a data science consultant for technology and marketing firms, where he has worked on problems in survey design and methodology, ranking, customer retention, and logistics.

    Homepage.

    Raul Castro Fernandez is an Assistant Professor of Computer Science at the University of Chicago. In his research he builds systems for discovering, preparing, and processing data. The goal of his research is to understand and exploit the value of data. He often uses techniques from data management, statistics, and machine learning. His main effort these days is on building platforms to support markets of data. This is part of a larger research effort on understanding the Economics of Data. He’s part of ChiData, the data systems research group at The University of Chicago.

    Homepage.

    Yuxin Chen is an assistant professor at the Department of Computer Science at the University of Chicago. Previously, he was a postdoctoral scholar in Computing and Mathematical Sciences at Caltech, hosted by Prof. Yisong Yue. He received my Ph.D. degree in Computer Science from ETH Zurich, under the supervision of Prof. Andreas Krause. He is a recipient of the PIMCO Postdoctoral Fellowship in Computing and Mathematical Sciences, a Swiss National Science Foundation Early Postdoc.Mobility fellowship, and a Google European Doctoral Fellowship in Interactive Machine Learning.

    His research interest lies broadly in probabilistic reasoning and machine learning. He is currently working on developing interactive machine learning systems that involve active learning, sequential decision making, interpretable models and machine teaching. You can find more information in my Google scholar profile.

    Homepage.

    Marshini Chetty is an assistant professor in the Department of Computer Science at the University of Chicago, where she co-directs the Amyoli Internet Research Lab or AIR lab. She specializes in human-computer interaction, usable privacy and security, and ubiquitous computing. Marshini designs, implements, and evaluates technologies to help users manage different aspects of Internet use from privacy and security to performance, and costs. She often works in resource-constrained settings and uses her work to help inform Internet policy. She has a Ph.D. in Human-Centered Computing from Georgia Institute of Technology, USA and a Masters and Bachelors in Computer Science from the University of Cape Town, South Africa. In her former lives, Marshini was on the faculty in the Computer Science Department at Princeton University and the College of Information Studies at the University of Maryland, College Park. Her work has won best paper awards at SOUPS, CHI, and CSCW and has been funded by the National Science Foundation, the National Security Agency, Intel, Microsoft, Facebook, and multiple Google Faculty Research Awards.

    Homepage.

    Nick Feamster is Neubauer Professor in the Department of Computer Science and the College. He researches computer networking and networked systems, with a particular interest in Internet censorship, privacy, and the Internet of Things. His work on experimental networked systems and security aims to make networks easier to manage, more secure, and more available.

    Homepage

    I lead an interdisciplinary computational and theoretical research group working on materials self-assembly, biomolecular simulation, viral dynamics, and vaccine design. My doctoral training provided me with expertise in molecular simulation, statistical mechanics, and machine learning, in which I developed new nonlinear machine learning approaches to study the conformations and dynamics of proteins, polymers, and confined water. During my post-doctoral fellowship, I acquired knowledge and skills in immunology and viral dynamics, and developed new computational tools for structure-free prediction of antibody binding sites, and the computational design of HIV vaccines using statistical mechanical principles.

    Since establishing my independent research program in 2012, I have combined these expertise to establish a dynamic research program in computational materials science and computational virology for which I have attracted over $2.9M in federal research funding, established a strong publication record (60+ papers) in leading journals, and have been recognized with a number of national awards including a 2018 Royal Society of Chemistry Molecular Systems Design and Engineering Emerging Investigator Award, 2017 Dean’s Award for Excellence in Research, 2016 AIChE CoMSEF Young Investigator Award, 2015 ACS Outstanding Junior Faculty Award, 2014 ACS Petroleum Research Fund Doctoral New Investigator Award, 2013 NSF CAREER Award, and I was named the 2013 Institution of Chemical Engineers North America “Young Chemical Engineer of the Year”. I am engaged and active within my professional organization serving on the AIChE Area 1a Programming Committee and as CoMSEF Liaison Director, and in organizing multiple scientific sessions at our national meetings. In addition to independent theoretical work, my research interests lead naturally to close collaboration with experimentalists and clinicians, teaching me the power of mutually reinforcing theoretical and experimental work and the importance of effective communication, planning, budgeting, teamwork and leadership.

    Homepage

    Maryellen L. Giger, Ph.D. is the A.N. Pritzker Professor of Radiology, Committee on Medical Physics, and the College at the University of Chicago. She is also the Vice-Chair of Radiology (Basic Science Research) and the immediate past Director of the CAMPEP-accredited Graduate Programs in Medical Physics/ Chair of the Committee on Medical Physics at the University. For over 30 years, she has conducted research on computer-aided diagnosis, including computer vision, machine learning, and deep learning, in the areas of breast cancer, lung cancer, prostate cancer, lupus, and bone diseases.

    Over her career, she has served on various NIH, DOD, and other funding agencies’ study sections, and is now a member of the NIBIB Advisory Council of NIH. She is a former president of the American Association of Physicists in Medicine and a former president of the SPIE (the International Society of Optics and Photonics) and was the inaugural Editor-in-Chief of the SPIE Journal of Medical Imaging. She is a member of the National Academy of Engineering (NAE) and was awarded the William D. Coolidge Gold Medal from the American Association of Physicists in Medicine, the highest award given by the AAPM. She is a Fellow of AAPM, AIMBE, SPIE, SBMR, and IEEE, a recipient of the EMBS Academic Career Achievement Award, and is a current Hagler Institute Fellow at Texas A&M University. In 2013, Giger was named by the International Congress on Medical Physics (ICMP) as one of the 50 medical physicists with the most impact on the field in the last 50 years. In 2018, she received the iBIO iCON Innovator award.

    She has more than 200 peer-reviewed publications (over 300 publications), has more than 30 patents and has mentored over 100 graduate students, residents, medical students, and undergraduate students. Her research in computational image-based analyses of breast cancer for risk assessment, diagnosis, prognosis, and response to therapy has yielded various translated components, and she is now using these image-based phenotypes, i.e., these “virtual biopsies” in imaging genomics association studies for discovery.

    She is a cofounder, equity holder, and scientific advisor of Quantitative Insights, Inc., which started through the 2009-2010 New Venture Challenge at the University of Chicago. QI produces QuantX, the first FDA-cleared, machine-learning-driven system to aid in cancer diagnosis (CADx). In 2019, QuantX was named one of TIME magazine’s inventions of the year.

    Homepage.

    Eric Jonas is a new professor in the Department of Computer Science at the University of Chicago. His research interests include biological signal acqusition, inverse problems, machine learning, heliophysics, neuroscience, and other exciting ways of exploiting scalable computation to understand the world. Previously he was at the Berkeley Center for Computational Imaging and RISELab at UC Berkeley EECS working with Ben Recht.

    Homepage.

    Sanjay Krishnan is an Assistant Professor of Computer Science. His research group studies the theory and practice of building decision systems that are robust to corrupted, missing, or otherwise uncertain data. His research brings together ideas from statistics/machine learning and database systems. His research group is currently studying systems that can analyze large amounts of video, certifiable accuracy guarantees in partially complete databases, and theoretical lower-bounds for lossy compression in relational databases.

    Homepage.

    My main research interests are in speech and language processing, as well as related aspects of machine learning.

    I am an Associate Professor at TTI-Chicago, a philanthropically endowed academic computer science institute located on the University of Chicago campus. We are recruiting students to our PhD program and visiting student program, as well as additional faculty, including in speech and language-related areas (more on Speech and Language at TTIC).

    I completed my PhD in 2005 at MIT in the Spoken Language Systems group of the Computer Science and Artificial Intelligence Laboratory. In 2005-2007 I was a post-doctoral lecturer in the MIT EECS department. In Feb.-Aug. 2008 I was a Research Assistant Professor at TTI-Chicago.

    Homepage

    David Miller’s research focuses on answering open questions about the fundamental structure of matter. By studying the quarks and gluons -—the particles that comprise everyday protons and neutrons —produced in the energetic collisions of protons at the Large Hadron Collider (LHC) at CERN in Geneva, Switzerland, Miller conducts measurements using the ATLAS Detector that will seek out the existence of never-before-seen particles, and characterize the particles and forces that we know of with greater precision. Miller’s work into the properties and measurements of the experimental signatures of these quarks and gluons –or jets” –is an integral piece of the puzzle used in the recent discovery of the Higgs bosons, searches for new massive particles that decay into boosted top quarks, as well as the hints that the elusive quark-gluon-plasma may have finally been observed in collisions of lead ions.

    Besides studying these phenomena, Miller has worked extensively on the construction and operation of the ATLAS detector, including the calorimeter and tracking systems that allow for these detailed measurements. Upgrades to these systems involving colleagues at Argonne National Laboratory, CERN, and elsewhere present an enormous challenge and a significant amount of research over the next several years. Miller is also working with state-of-the art high-speed electronics for quickly deciphering the data collected by the ATLAS detector.

    Miller received his PhD from Stanford University in 2011 and his BA in Physics from the University of Chicago in 2005. He was a McCormick Fellow in the Enrico Fermi Institute from 2011-2013.

    Homepage.

    Dan Nicolae obtained his Ph.D. in statistics from The University of Chicago and has been a faculty at the same institution since 1999, with appointments in Statistics (since 1999) and Medicine (since 2006). His research focus is on developing statistical and computational methods for understanding the human genetic variation and its influence on the risk for complex traits, with an emphasis on asthma related phenotypes. The current focus in his statistical genetics research is centered on data integration and system-level approaches using large datasets that include clinical and environmental data as well as various genetics/genomics data types: DNA variation, gene expression (RNA-seq), methylation and microbiome.

    Homepage

    Chenhao Tan is an assistant professor at the Department of Computer Science and the Department of Information Science (by courtesy) at University of Colorado Boulder. His main research interests include language and social dynamics, human-centered machine learning, and multi-community engagement. He is also broadly interested in computational social science, natural language processing, and artificial intelligence.

    Homepage.

    Rebecca Willett is a Professor of Statistics and Computer Science at the University of Chicago. She completed her PhD in Electrical and Computer Engineering at Rice University in 2005 and was an Assistant then tenured Associate Professor of Electrical and Computer Engineering at Duke University from 2005 to 2013. She was an Associate Professor of Electrical and Computer Engineering, Harvey D. Spangler Faculty Scholar, and Fellow of the Wisconsin Institutes for Discovery at the University of Wisconsin-Madison from 2013 to 2018.  Prof. Willett received the National Science Foundation CAREER Award in 2007, was a member of the DARPA Computer Science Study Group 2007-2011, and received an Air Force Office of Scientific Research Young Investigator Program award in 2010. Prof. Willett has also held visiting researcher positions at the Institute for Pure and Applied Mathematics at UCLA in 2004, the University of Wisconsin-Madison 2003-2005, the French National Institute for Research in Computer Science and Control (INRIA) in 2003, and the Applied Science Research and Development Laboratory at GE Medical Systems (now GE Healthcare) in 2002. Her research interests include network and imaging science with applications in medical imaging, wireless sensor networks, astronomy, and social networks. She is also an instructor for FEMMES (Females Excelling More in Math Engineering and Science; news article here) and a local exhibit leader for Sally Ride Festivals. She was a recipient of the National Science Foundation Graduate Research Fellowship, the Rice University Presidential Scholarship, the Society of Women Engineers Caterpillar Scholarship, and the Angier B. Duke Memorial Scholarship.

    Homepage

    Heather Zheng is the Neubauer Professor of Computer Science at University of Chicago. She received my PhD in Electrical and Computer Engineering from University of Maryland, College Park in 1999. Prior to joining University of Chicago in 2017, she spent 6 years in industry labs (Bell-Labs, NJ and Microsoft Research Asia), and 12 years at University of California at Santa Barbara. At UChicago, she co-directs the SAND Lab (Systems, Algorithms, Networking and Data) together with Prof. Ben Y. Zhao.

    Homepage.

  • Application

    Applications for the 2021 Rising Stars in Data Science Workshop are now closed.

    Application Timeline

    • Student Application Deadline (UPDATED): November 30th, 2020, 11:59pm CT
    • Faculty Nomination Deadline: November 30th, 2020, 11:59pm CT
    • Notification Deadline (Accepted Speakers): Week of December 14th, 2020
    • Workshop: January 11-12th, 2021

    Application Requirements

    The application is available through InfoReady. If you have not previously used InfoReady, you will be required to create an account in order to submit your application.

    • Resume/CV
    • Biography (100 words)
    • Research talk title
    • Research talk abstract (250 words)
    • Research statement outlining research goals, potential projects of interest, and long-term career goals (2 pg, standard font at a size 11 or larger)
    • Letter of recommendation (1 pg maximum, standard font at a size 11 or larger, a recommendation request will be sent when you add your reference to the application system)
    • Short answer (1,000 characters max per question)
      • What long-term impact do you hope to have with your research on the field of data science?
      • What is your timeline for going on the academic or industry job market? 
      • The Center for Data and Computing (CDAC) at UChicago focuses on early-stage, cutting-edge data science research to advance the establishment of this emerging field. In your opinion, what areas of data science research are currently missing or nascent, but the most promising?
      • How do you hope your research will advance issues germane to data science ethics, such as biased datasets, privacy, and the ethical use of data?
      • Please list 1-2 members from the UChicago organizing committee that you would be interested in having a 1:1 discussion with at the workshop.

    Review Criteria

    Proposals will be reviewed by the Rising Stars in Data Science Committee using the following scoring rubric (0-3 points per criterion):

    • Research Potential: Overall potential for research excellence, demonstrated by research statement, goals and long-term career goals.
    • Academic Progress: Academic progress to date, as evidenced by publications and endorsements from their faculty advisor or nominator.
    • Computational Background: Strong computational skills and expertise, ideally with coursework in computer science, statistics, data science, AI or a related field.
    • Data Science Commitment: Experience with interdisciplinary research that advances research innovation in the fields of data science or artificial intelligence.

    Due to the volume of applications we receive, we will be unable to provide reviewer feedback on applications that are not accepted.

  • FAQ
    • Rising Stars FAQ