I'm interested in the Science of Generative Models (e.g., LLMs) where I ask how these models work, when, and why.
I develop holistic, causal, and data-centric approaches to study generative models.
These days, I focus on the data on which such models are trained and draw connections between the data and model behavior.
I'm happy to talk about research in general, and my own work in particular.
If you have any questions about one of my papers, or my overall research, feel free to reach out!
I am on the academic job market!
News
Keynote talk at The First Workshop on Large Language Model Memorization (L2M2) @ ACL 2025
Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, Yoshua Bengio
arxiv paper
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
*Lester James V. Miranda, *Yizhong Wang, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, Pradeep Dasigi
ACL 2025 papercoderesourceblog
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge
ACL system demonstrations 2025 paperdemo
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Shanshan Xu, T.Y.S.S Santosh, Yanai Elazar, Quirin Vogel, Barbara Plank, Matthias Grabmair
The First Workshop on Large Language Model Memorization @ ACL 2025 paper
On Linear Representations and Pretraining Data Frequency in Language Models
Jack Merullo, Noah A. Smith, *Sarah Wiegreffe, *Yanai Elazar
ICLR 2025 paper
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
*Xinyi Wang, *Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang
ICLR 2025 papercode
Calibrating Large Language Models with Sample Consistency
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar,
Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman,
Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi
ACL 2024 🏆 Best Theme Paper paperlongcoderesourcemodels
Press: TechCrunchAxiosForbesGeekWireSD TimesVentureBeatFast Company
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson,
Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo
ACL 2024 🏆 Best Resource Paper paperlongcoderesource
The Bias Amplification Paradox in Text-to-Image Generation
First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
Benjamin Muller, Yanai Elazar, Benoît Sagot and Djamé Seddah
EACL 2021 papershortcode
*Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals
Yanai Elazar, Shauli Ravfogel, Alon Jacovi, Yoav Goldberg
TACL 2021
(*) previous version that appeared on arxiv was named: "When Bert Forgets How To POS: Amnesic Probing of Linguistic Properties and MLM Predictions",
which we changed to the current title to better reflect our contributions. paperjournalcodeslidesvideo
2020
At Your Fingertips: Extracting Piano Fingering Instructions from Videos
Amit Moryossef, Yanai Elazar, Yoav Goldberg
arxiv papercode
It’s not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT
Hila Gonen, Shauli Ravfogel, Yanai Elazar, Yoav Goldberg
Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, at EMNLP 2020 paperlongcodeposter
The Extraordinary Failure of Complement Coercion Crowdsourcing
Yanai Elazar, Victoria Basmov, Shauli Ravfogel, Yoav Goldberg, Reut Tsarfaty
Workshop on Insights from Negative Results in NLP, EMNLP 2020 papershortslidesvideo
Do Language Embeddings Capture Scales?
Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, Dan Roth
Findings of EMNLP 2020 paperlongcode
Unsupervised Distillation of Syntactic Information from Contextualized Word Representations
*Shauli Ravfogel, *Yanai Elazar, Jacob Goldberger, Yoav Goldberg
Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, at EMNLP 2020 paperlongcodeslides
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
Evaluating Models' Local Decision Boundaries via Contrast Sets
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, Ben Zhou
Findings of EMNLP 2020 paperlongresource
oLMpics -- On what Language Model Pre-training Captures
Alon Talmor, Yanai Elazar, Yoav Goldberg, Jonathan Berant
TACL 2020 (presented at EMNLP 2020) paperjournalcodevideo
2019
Adversarial Removal of Demographic Attributes Revisited
Maria Barrett, Yova Kementchedjhieva, Yanai Elazar, Desmond Elliott, Anders Søgaard
EMNLP 2019 papershort
How Large Are Lions? Inducing Distributions over Quantitative Attributes