CS59200: AI/DC Networking

Instructor: Vamsi Addanki
Department: Computer Science, Purdue University
Semester: Fall 2025
Location: LWSN B134
Time: TR 3:00PM - 4:15PM
Credit hours: 3
Email: vaddank@purdue.edu

Target Audience

This course is particularly valuable for early-stage PhD students in computer science who are interested in pursuing research at the intersection of AI, networking, and computing systems.

Course Overview

Have you ever wondered what makes massive AI models possible? How do thousands of GPUs communicate to train today’s largest models? Can AI itself help configure and optimize the networks that connect them? From cutting-edge datacenter architectures to adaptive photonic interconnects that bend light, this course dives into the critical role networking plays in enabling AI at scale — and how AI, in turn, is transforming the way we build and manage networks. In this course, we will explore the algorithms, architectures, and open research challenges at the frontier where AI and networking meet.

This seminar-style course will cover a range of topics, including: datacenter network topologies, collective communication algorithms (GPU-to-GPU communication), photonic interconnects, network congestion control and load balancing, AI-assisted algorithm design, and the use of AI in network management and optimization. The tentative weekly schedule is as follows. Optional reading material is just for your reference to explore the literature (strongly recommended), but not explicitly required for the course.

Tentative Schedule

Week	Date	Paper Title	Presenter	Optional Reading
Warmup!
1	Aug 26	Introduction Slides	All	–
1	Aug 28	How to read a paper [1]	All/Discussion	[2, 3]
LLM Training Architectures (Hyperscaler Experience)
2	Sep 2	RDMA over Ethernet for Distributed Training at Meta Scale [4]	[Student]	[5, 6, 7]
2	Sep 4	Alibaba HPN: A Data Center Network for LLM Training [8]	[Student]	[9, 10]
Collective Communication I: Primitives & AllReduce
3	Sep 09	Optimization of Collective Communication in MPICH [11]	[Student]	[12]
3	Sep 11	Swing: Short-cutting Rings for Higher Bandwidth AllReduce [13]	[Student]	[14]
Collective Communication II: Synthesis
4	Sep 16	Synthesizing Optimal Collective Algorithms [15]	[Student]	[16]
4	Sep 18	Collectives as Multi-Commodity Flow [17]	[Student]	[18, 19]
Collective Communication III: Stragglers
5	Sep 23	Accelerating AllReduce with a Persistent Straggler [20]	[Student]	[21]
5	Sep 25	OptiReduce: Tail-Optimal AllReduce [22]	[Student]	[23, 24]
Assignment 1 Due
Photonic Interconnects I: Oblivious & Traffic-Aware
6	Sept 30	RotorNet [25]	[Student]	[26, 27, 28, 29]
6	Oct 2	Scheduling in Hybrid Networks [30]	[Student]	[31, 32, 33, 34]
Photonic Interconnects II: TPU Clusters
7	Oct 7	TPU v4 Supercomputer [35]	[Student]	[36]
7	Oct 09	Resiliency at Scale: TPUv4 [37]	[Student]	[38]
Photonic Interconnects III: Topologies for Collectives
8	Oct 14	SiP-ML [39]	[Student]	[40]
8	Oct 16	TopoOpt [41]	[Student]	[42]
Assignment 2 Due
Photonic Interconnects IV: Chip-to-Chip
9	Oct 21	Server-Scale Photonic Connectivity [43]	[Student]	[44, 45]
9	Oct 23	Midterm Examination
AI for Networks I: LLMs & Fun Stuff
10	Oct 28	Enhancing Network Management Using Code Generated by LLMs [46]	[Student]	[47]
10	Oct 30	What do LLMs need to Synthesize Correct Router Configurations? [48]	[Student]	[49, 50]
AI for Networks II: Performance Guarantees
11	Nov 4	Credence: Augmenting Switch Buffer Sharing with ML Predictions [51]	[Student]	[52]
11	Nov 6	Towards Integrating Formal Methods into ML-Based Systems [53]	[Student]	[54]
AI for Networks III: Wide-Area Networks
12	Nov 11	DOTE: Rethinking (Predictive) WAN Traffic Engineering [55]	[Student]	[56]
12	Nov 13	Transferable Neural WAN TE for Changing Topologies [57]	[Student]	[58]
AI for Networks IV: Congestion Control
13	Nov 18	TCP ex Machina [59]	[Student]	[60]
13	Nov 20	PCC: Re-architecting Congestion Control [61]	[Student]	[62]
Assignment 3 Due
Thanksgiving Break
14	Nov 25	–
14	Nov 27	–
Projects & Feedback
15	Dec 2	Project submissions due next week
15	Dec 4	Project submissions due next week
Finals Week
16	Dec 09	Final Presentations (All)
16	Dec 11	Final Presentations (All)

Assignments, Midterm, and Final Project

The course is structured around student-led presentations and discussions held during weekly sessions, with the instructor providing guidance and facilitating exploration of the material. Each student will present assigned research papers to the class and participate in discussions to enhance collective understanding. Course evaluation is based on three assignments, one midterm exam, and a final research project.

Assignment 1. Each student will be assigned an AllReduce algorithm (or a synthesized variant) to implement in the Astra-Sim simulator. The simulation should use a ring topology of $16$ nodes, each connected by $400$ Gbps links with a $500$ ns propagation delay. The goal is to evaluate the algorithm’s completion time and compare it against the baseline Ring AllReduce across a range of message sizes.
Assignment 2. Building on the first assignment, extend the implementation to a reconfigurable ring topology where nodes are connected via a photonic switch. The objective is to optimize the circuit-switching schedule to minimize the AllReduce completion time. Students may submit either: (i) A proof showing the minimized completion time based on an optimized schedule, or (ii) Simulation results using Astra-Sim, along with a clear description of the optimization method used.
Assignment 3.
- Option 1: Implement the assigned AllReduce algorithm using NVIDIA Collective Communication Library (NCCL) by writing a CUDA code or using torch in python, and evaluate the performance in an $8$ GPU cluster (access will be provided). nccl-tests repository provides a good starting point for implementation.
- Option 2: Extend HTTP/3 QUIC transport protocol with a learning-augmented congestion control algorithm (e.g., having Cubic or Reno as base algorithms and leveraging ML predictions about network conditions) and implement in aioquic. Test the final implementation by sending iperf traffic to different remote servers and compare the throughput-latency-loss performance against the baseline QUIC implementation. Students may explore any learning algorithm of their choice, with emphasis on the techniques and methods discussed in the course schedule.

Midterm: The format of the midterm will be announced during the semester and will focus on the core concepts underlying the algorithms and protocol designs covered in the weekly readings.

Final project: The course concludes with a final research project, to be submitted via an internal HotCRP portal. Evaluation of the project will primarily consider the originality of the algorithmic or systems design proposal, the depth of related work understanding, and the quality of the presentation.

Learning Objectives

Develop critical thinking in networking.
Lead and participate in academic discussions.
Analyze and present research papers.
Explore and propose innovative solutions.
Implement/test GPU communication algorithms.
Write research papers on AI/DC networking.