CS 294-288: Data-Centric LLMs

Fall 2026

Instructor: Sewon Min
Class hours: TuThu 14:00-15:30 (14:10-15:30 considering Berkeley time)
Class location: Gateway B1023
Office hours: By appointment
Contact: sewonm@berkeley.edu (Please include “294-288” in the email subject)

Important! Students should be very familiar with CS 288 materials to enroll. All interested students (PhD, masters, and undergraduate) will be required to complete a form and pass a qualification quiz assessing CS 288 knowledge. Link TBA.

Important! We will use Slack for communication (no Ed). An invitation will be sent after the first class.

Overview: Advances in large language models (LLMs) have been driven by the increasing availability of large, diverse datasets. But where do these datasets come from, how are they used, and how can we leverage them more effectively? This course explores these questions, examining what data we use, how and why it works, and the challenges it introduces in LLM development.

The course is primarily designed for PhD students and centers on paper readings, discussions, and an open-ended project. Students are expected to have a strong background in ML/NLP/LLMs and be familiar with CS 288 materials, with the ability to independently engage with research papers.

Class Schedule

TBA, but feel free to check last year’s schedule for reference.


This site uses Just the Docs, a documentation theme for Jekyll.