About me

I'm an Assistant Professor of AI and Decision-making in the MIT EECS department. I work to understand uncertainty and reliable decision-making with data. In particular, I develop tools for statistical inference with AI models, data impacted by strategic behavior, and settings with distribution shift. In addition, I work on applications in the life sciences and sustainability.

Previously, I was a postdoctoral researcher with Michael I. Jordan in the UC Berkeley Statistics and EECS departments. I completed my Ph.D. in the Stanford Department of Statistics advised by Emmanuel Candès, where I was awarded the Theodore W. Anderson Theory of Statistics Dissertation Award. Before my Ph.D., I studied statistics and mathematics at Harvard University, and spent a year teaching mathematics at NYU Shanghai. Outside research, I enjoy triathlons, sailing, hiking, and reading speculative fiction novels.

Research themes

I believe that the conceptual, algorithmic, and mathematical advances enable us to use data and AI models to better understand complex patterns in the physical and social world and to build reliable automated systems. To this end, I focus on developing statistical principles and formal frameworks to understand challenging types of data that are increasingly important. In particular, I work on

  • Statistical inference with AI systems. AI models based on deep neural networks are increasingly used in real-world systems. Their use is motivated by the fact that they have the best performance with high-dimensional data, such as image and natural language data. However, the standard statistical toolbox does not apply here; users seeking assurances about the reliability of these models, such as confidence intervals on predictions or bounds on the false discovery rate across multiple decisions, are left with little recourse based on the existing literature. I seek to build out a rich statistical toolbox for AI models, so that researchers can use these powerful systems while remaining on solid statistical ground. My work in this theme builds on core statistical techniques such as resampling methods, multiple hypothesis testing, and empirical process theory. [e.g., 1, 2, 3]

  • Data impacted by strategic behavior and information asymmetry. Data emerging from systems with human decision-makers is increasingly important, and the possible strategic behavior raises new inferential challenges. For example, profit-sensitive pharmaceutical companies sponsor clinical trials — which are then analyzed according to some statistical protocol — and are heavily rewarded for drugs that are approved. Correctly analyzing data affected by strategic agents is critical, and I am building methods for this, building on concepts from decision theory, game theory, and statistics. [e.g., 1, 2, 3]

  • Shifting distributions and feedback loops. More broadly, data are increasingly collected from dynamic environments with shifting distributions, and these shifts can be caused by changes made to the system or policy. I work to extend statistical methods in such non-I.I.D. settings. For example, consider protein design, where the analyst has access to some set of proteins and an associated fitness score. The goal is to design a new protein that has higher fitness than those seen previously. The analyst might fit a model predicting fitness from protein structure, and then chooses a good candidate protein to synthesize and measure the fitness of in a wet-lab experiment. This process is repeated several times, so there is a feedback loop; the model the analyst fits affects the subsequent data collection. Such, non-I.I.D. settings with shifting distributions are increasingly relevant to modern data analysis, and it is essential to create techniques to address this. [e.g., 1, 2, 3]

I'm especially interested in applications in the life sciences and sustainability.

News

  • I have openings for PhD students to work with me starting in the 2023-2024 academic year. If you're at MIT, send me an email and we can find a time to talk about research. If you're not yet at MIT, consider applying to the EECS PhD program and mention me in your application.

Select recent papers

“Incentive-Theoretic Bayesian Inference for Collaborative Science”

S. Bates, M. I. Jordan, M.Sklar, J. A. Soloff.
arXiv preprint, 2023.
[arXiv] [bibtex]

“Prediction-Powered Inference”

A. N. Angelopoulos, S. Bates, C. Fannjiang, M. I. Jordan, and T. Zrnic.
Science, 2023.
[arXiv] [free journal version] [journal] [code] [bibtex]

“Conformal Risk Control”

A. N. Angelopoulos, S. Bates, A. Fisch, L. Lei, T. Schuster.
ICLR, 2024. (spotlight presentation)
[arXiv] [code] [bibtex]

“Testing for Outliers with Conformal p-values”

S. Bates, E. Candès, L. Lei, Y. Romano, and M. Sesia.
Annals of Statistics, 2023.
[arXiv] [journal] [code] [bibtex]

“Distribution-Free, Risk-Controlling Prediction Sets”

S. Bates, A. Angelopoulos, L. Lei, J. Malik, and M. I. Jordan.
Journal of the ACM, 2021.
[arXiv] [journal] [video] [blog] [code] [bibtex]

“Cross-validation: what does it estimate and how well does it do it?”

S. Bates, T. Hastie, and R. Tibshirani.
Journal of the American Statistical Association (JASA), 2023.
[arXiv] [journal] [code] [bibtex]

“Causal Inference in Genetic Trio Studies”

S. Bates, M. Sesia, C. Sabatti, and E. Candès.
Proceedings of the National Academy of Sciences of the USA (PNAS), 2020.
[arXiv] [journal] [video] [tutorials+code] [bibtex]
*Selected as a cover article and for invited commentary.