Mathematics of Data Science - 554.747 (Fall 2025)

This course revolves around a single question: how can we efficiently extract information from high-dimensional, noisy data with limited samples? This proof-driven graduate class develops the mathematical foundations behind modern data science, drawing on high-dimensional probability, optimization theory, and statistical learning. Core topics include concentration inequalities; dimension reduction; clustering; structure-exploiting inference; generalization via uniform convergence and algorithmic stability; and information-theoretic limits (Le Cam, Fano).

Coordinates #

Time: TTh 9:00PM - 10:15PM
Location: Hodson 216

Personnel #

Instructor:
Mateo Díaz (mateodd at jhu ~~dot~~ edu)
OH Th 4:00PM - 5:30PM Wyman S429

Teaching Assistants:
Pedro Izquierdo Lehmann (pizquie1 at jhu ~~dot~~ edu)
OH M 9:30-10:15am Wyman S425

Ian McPherson (imcpher1 at jhu ~~dot~~ edu)
OH F 9:00AM - 10:30AM Wyman S425

Lecture notes #

Handwritten lecture notes will be posted here.

Textbook #

We will use the following references:

(Main reference) Roman Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science, 2nd Edition. Cambridge University Press (2025).
Martin J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint, 1st Edition. Cambridge University Press (2019).
Francis Bach, Learning Theory from First Principles, 1st Edition. MIT Press (2024).
Shai Shalev-Shwartz, Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, 1st Edition. Cambridge University Press (2014).
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer (2009).
Terence Tao, Topics in Random Matrix Theory, 1st Edition. American Mathematical Society (2012).
Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, Spectral Methods for Data Science: A Statistical Perspective, 1st Edition. Now Foundations and Trends (2021).

Grading system #

Your grade will take into account four components: Homework (50%), Take-home exam (20%), Final project (20%), and Participation (10%). In what follows we ellaborate on each of these components.

Homework #

Problem sets (approximately five) will be posted here and on the course Canvas. Some homework assignments include at least one question that involves the writing and testing of code; Python is prefered. Please submit homework assignments on Gradescope.

General policies. Your solutions must be written legibly and intelligibly in clear English. Use complete sentences. Points may be taken off for disorganized or illegible work. Collaboration is welcome, but your writeup must be your own—do not copy answers from somebody else. Indicate at the top of your homework who you collaborated with on the assignment. If you believe your homework or exam grade to be in error, submit a regrade request. Every student will be allowed one late submission, up to 24 hours after the due date (no questions asked); just let us know that you will use your one shot.

Large language models policy. The use of large language models (LLMs) for brainstorming, coding assistance, and text polishing is permitted. What is not permitted is blindly answering your assignments or Midterm with LLMs. We reserve the right to run your submissions through https://gptzero.me/ and if it outputs a probability greater than 90% of being AI-generated, you may be required to take an oral examination to defend your work. Please begin your assignment with a short note explaining how, if at all, you used LLMs.

Midterm #

There will be one take-home exam with a date TBA. The exam will be posted on Canvas, and you will have two days to turn in your solutions through Gradescope. You may not discuss the exam with anyone or seek external help.

Final project #

There will be a final project, which gives students the opportunity to explore topics related to the course that were not covered in class. Students may work individually or in groups of up to four, and the sole deliverable is a written report. Suggested topics will be released two weeks before the end of class, though students are welcome to propose their own with instructor approval. Reports should be written in the style of lecture notes if the topic is well established, or as a short research paper if the topic concerns current work. In either case, the report should state one main result or idea, explain why it is interesting and what implications it has (including potential new research directions), and present its proof—or a sketch thereof if the full argument is too long.

Participation #

Participation weights 10% in the final grade. Engaging in class, Piazza, and office hours will count toward participation. This includes asking questions (even if you think they are silly!) and pointing out typos or mistakes.