CS 511: Advanced Data Management
Time: 2:00 PM - 3:15 PM (Tue/Thu)
Location: 103 Transportation Building (Google Map)
See this course in [Canvas]
[Course Explorer]
[Canvas SignUp Request]
Instructor: Yongjoo Park (https://yongjoopark.com):
- Email: yongjoo@g.illinois.edu
- Office Hours: After class in person at Rm 2114 Siebel for CS
TA 1: Hongtai Cao (hongtai2@illinois.edu)
- Office Hours: Wednesday 10-11 AM in person
TA 2: Talika Gupta (talikag2@illinois.edu)
- Office Hours: Friday 2-3 PM in person
Prerequisites
- Background: CS411 (Database) or equivalent.
- Programming: Proficiency in one of the following programming languages: C, Java, etc. We don't provide programming practices; you should be able to write code independently.
Class Structure
- Overview by Instructor: 5 mins
- Student Presentation (if any): 25 mins (including Q&A)
- Instructor Presentation: 40 mins
- Quiz: 5 mins
Your main workload as a student will be
- Understanding papers (take a quiz every class and two midterms)
- Participating in group projects (3 MPs and 1 research)
- (Optional) presenting a paper for 5% extra credits
-
[Slides in Google Drive] Accessible to all Illinois people; Permission denied? See this page.
Topic I: Analytical Database
Week 1: Distributed Processing I
8/27: Course Overview / ComesAround by Stonebraker and Hellerstein
- Main question: Why relational model?
8/29: MapReduce by Dean and Ghemawat
- Main question: How can we scale analytic operations?
- (read more: ParallelDB by DeWitt and Gray)
- Presentation sign-up (1/2): 8/30 8 AM, Canvas announcement
Week 2: Distributed Processing II
9/3: Distributed Join by Blanas et al.
- Main question: What join algorithms for distributed DBMSs?
- Project group sign-up starts (for both MPs and Research project)
9/5: Google File System by Ghemawat et al.
- Main question: What file sysetems for distributed DBMSs?
Week 3: Columnar Layout
9/10: C-Store by Stonebraker et al.
- Main question: Different data layouts for analytical workloads?
- Student presentation 1 (Early-bird bonus: 20%)
9/12: Dremel
- Main question: How can we represent nested data?
- Student presentation 2 (Early-bird bonus: 20%)
- Project 1 (MP: Spark/HDFS) Out (due in three weeks: 10/2)
Week 4: Sketching
9/17: Monkey by Dayan and Idreos
- Main question: Basics and optimization of Bloom filters
- Student presentation 3 (Early-bird bonus: 15%)
9/19: HyperLogLog (Read more: HLL++ by Google)
- Main question: Memory-efficient counting of unique items
- Student presentation 4 (Early-bird bonus: 15%)
- Project 4 (research) Proposal Out (Due 11/7)
Week 5: Sampling
9/24: BlinkDB
- Student presentation 5 (Early-bird bonus: 10%)
9/26: Online Aggregation by Hellerstein (Read more: DeepOLA by Park)
- Main question: New query processing paradigm
- Student presentation 6 (Early-bird bonus: 10%)
- Presentation sign-up (2/2): 9/28 8 AM, Canvas announcement
Week 6: Cardinality Estimation
10/1: SystemR at IBM (Read more: QuickSel by Park)
- Main question: Basics of cardinality/selectivity estimation
- Student presentation 7 (Early-bird bonus: 5%)
10/3: Naru by Yang and Stoica
- Main question: Deep learning for cardinality estimation
- Student presentation 8 (Early-bird bonus: 5%)
- Project 1 Due by 10/2 11:59 PM Central
- Project 2 (MP: DeepOLA) Out (due in three weeks: 10/23)
- project topic: Deep Online Aggregation
Week 7: Compression
10/8: Columnar Compression by Abadi and Madden
- Main question: Data compression techniques for DBMSs
- Student presentation 9
10/10: Midterm 1 (in-person; Scope: Week 1-7)
- Online option available on 10/12 (Saturday) at 2 PM Central; 10% point reduction.
- To take the online version, you must request it by the end of 10/3 via email to the instructor Yongjoo Park OR you must have a very good reason (e.g., unforeseeable tragic events).
Topic II: Data Science
Week 8: Dataframe
10/15: Modin by Petersohn and Parameswaran
- Main question: Scalable library for dataframe operations
- Student presentation 10
10/17: Ray by Moritz and Stoica
- Main question: Distributed framework behind Modin and ChatGPT
- Student presentation 11
Week 9: Scalable Data Science
10/22: ElasticNotebook by Li and Park
- Main question: Can we save Jupyter notebooks?
- Student presentation 12
10/24: Differential Privacy
- Main question: How can we share sensitive information without breaching privacy?
- Student presentation 13
- Project 2 Due by 10/23 11:59 PM Central
- Project 3 (MP: Ray/Pandas) Out (due in three weeks: 11/13)
Topic III: Transaction and System Tuning
Week 10: Database Transaction
10/29: Snapshot Isolation
- Main question: Snapshot isolation vs Serializability
- Student presentation 14
10/31: Aria by Lu and Madden
- Main question: How does deterministic transactions work?
- Student presentation 15
Week 11: Learning for DBMS
11/5: Learned Index by Kraska
- Main question: Machine learning for indexing
- (Read more: AirIndex by Chockchowwat and Park)
- Student presentation 16
11/7: OtterTune by Aken and Pavlo
- Main question: ML for System Optimization
- Student presentation 17
- Project 4 (research) Proposal Due
Topic IV: No SQL
Week 12: Key-Value Store
11/12: LSM-Tree Slides by Dayan
- Main question: Basics of LSM tree: the data structure behind LevelDB/RocksDB
- Student presentation 18
11/14: PebblesDB
- Main question: Understand write amplification in LSM Tree
- Student presentation 19
- Project 3 Due by 11/13 11:59 PM Central
Week 13: Stream Processing
11/19: Kafka -> Confluent
- Main question: Most well-known stream procsesing system
- Student presentation 20
11/21: Naiad: Timely Dataflow -> Materialize
- Main question: A new alternative to stream processing
- Student presentation 21
Fall Break
11/26: Fall break
11/28: Fall break
Week 14: Final Presentation and Midterm 2
12/3: Project Presentation I
- Project 4 (report; 4 pages) Due by 12/2 11:59 PM Central
- Up to 12 groups will present this day; 5 mins per group
12/5: Project Presentation II
- The remaining groups will present this day; 5 mins per group
12/10: Midterm 2 (In-person; Scope: Week 8-13) (Last day of instruction)
- Online option available on 12/12 (Saturday) at 2 PM Central; 10% point reduction.
- To take the online version, you must request it by the end of 12/3 via email to the instructor Yongjoo Park OR you must have a very good reason (e.g., unforeseeable tragic events).
-
Overview
Regular Components
- Evaluating Other Student Presentations: 5%
- Quizzes: 11%
- Midterms: 40% (20% + 20%)
- Group Projects: 40%
- Participation: 4%
Extra Credits
- Student Presentation: 2% (min) - 5% (max); evaluation based on peers
Evaluating Other Student Presentations (5%)
There are 21 student presentation slots. For each presentation, you should provide peer evaluation using a Likert scale between 1 and 5. To receive full credit (5%), you must submit evaluations for at least 50% of the presentations (e.g., 50 if there are 100 presentations in total). If you submit less, you will receive partial credit (e.g., if you submit 25 evaluations among 100 total presentations, you will receive 25/50*5% = 2.5%).
Warning: If you submit constant scores (e.g., 5/5 for every presentation), it will be considered as “no submission” and your Participation score may get penalized.
[Presentation Evaluation Form]
Quizzes (11%)
In the middle of each class, there will be a short quiz. You will receive a full 11% by scoring 70/100 (or 70%) in the quiz portion. Specifically, to calculate the final score, your individual quiz scores will be summed and normalized to have a max 100 points (which is called a raw score); then, the final score will be calculated as:
(final quiz score; max 100) = min(raw score / 70 * 100, 100)
For regular absences, no make-up quizzes will be provided (so come to class!). For Verified Absences, we will provide make-up quizzes. Sample quizzes will be provided in the first and second classes.
“70/100” is to accommodate unexpected absences (e.g., mild cold, too busy for other courses, family travel). No additional absences will be allowed except for extreme circumstances (e.g., seriously ill, contagious disease, tragic events); in these cases, please contact the main instructor directly to discuss and/or adjust your course workload.
Quiz Evaluation: You can receive up to two points in each quiz. If you submit an answer (in class), you will get one point always (even though your answer is wrong). If your answer is correct, you will receive one additional point, thereby two points in total for that quiz.
Quiz scores will be made available immediately.
Midterms (40%: 20% + 20%)
We have two midterms: the first on 10/10 and the second on 12/10. Exam questions are based on the quizzes. Online make-up exams are offered for important events (e.g., family travel) two days after each exam day with a 10% relative point reduction (thus, perfect answers will receive 90 out of 100, if the total is 100). The point reduction may be removed if the reason is something serious and unavoidable (e.g., tragic events, serious illness).
Your midterm scores will be released within one week. This delay is to accomodate the students who may be taking online exams.
Projects (40%): three MPs and one research project
Four students will form a group for the project. The goal of the projects is to learn well-known (modern) systems in the area of large-scale data management. Please read the following document to learn more about what tasks will be accomplished.
Late submissions will receive a 20% point reduction every day. For example, if you submit within the first day after the deadline, you will receive up to 80/100. After five days, you will receive 0 points. We will check both report submission times (if required) and the last timestamps of code changes (e.g., on GitHub).
Participation (4%)
Your in-class participation will be evaluated based on your participation in questions/discussions. If you ask questions or participate in discussions either in class or online, claim your contribution using this Google Form within two days of the class. You should not falsely claim your contribution (e.g., by submitting forms without actually asking questions), which will be in violation of the code.
Extra Credit: Student Presentation (5%)
Presentation Scores
You will receive extra credit based on the peer evaluations by other students. To ensure most students receive high-enough scores, the following (rather complex) scheme will be used.
(your presentation score; max 100) = \sum_i^N (normalized score from student-i) / N
where (normalized score from student-i) is obtained by normalizing all the scores submitted by student-i to have mean 93 and standard deviation 5. That is, even if student-i submits relatively bad scores (say 2/5 on average), his/her negative scores do not affect your overall score since those scores will be anyway normalized to have the same mean and standard deviation. The same logic applies to another student (say student-j) who tends to submit higher scores (say 4/5 on average) since his/her scores will also be normalized to have the same mean and standard deviation.
Your presentation scores will only be released almost at end of the semester due to normalization process.
Early-bird Bonus
To encourage presenting papers we cover earlier weeks of this course, we will give an additional bonus by scaling up your “presentation score” (up to 20%). That is, if you present in the second week (Lecture No 4), scaling factor 1.2 will be applied to your “presentation score”, making your final score “presentation score” * 1.2. The scaling ratio decreases over time. For more details, please see this Presentation Schedule (to view this document, you must be logged in using an Illinois account).
-
Academic Integrity Policy and Procedure
Anti Racism and Inclusivity Statement
Accessibility and Accommodations
Sexual Misconduct Reporting Obligations
Statement on CS CARES and CS Values and Code of Conduct
All members of the Illinois Computer Science department - faculty, staff, and students - are expected to adhere to the CS Values and Code of Conduct. The CS CARES Committee is available to serve as a resource to help people who are concerned about or experience a potential violation of the Code. If you experience such issues, please contact the CS CARES Committee. The instructors of this course are also available for issues related to this class.