CS4225

Course Title

Big Data Systems for Data Science

Grade

A-

Semester

AY23/24 S2

Review

This course is all about Big Data and the tools we can use to deal with them. The first part throws in a lot of new terminology and definitions at us. Like how we can measure the performance of a data pipeline, or measure how much space we need.This led up to the Map Reduce model, which makes up a pretty substantial part of the course. Map Reduce is like a pipeline, where data is passed through a series of functions, and each function does a little bit of processing on the data. The data is then passed to the next function, and so on. This is a very powerful model, as it allows us to process data in parallel, which is very important when dealing with large datasets. Questions they may ask are about designing the right functions to pass into the Map Reduce model, and how to ensure that the data is processed correctly. IIRC this whole part was the Hadoop part, which is the more traditional approach to dealing with big data.

The more modern one would be Spark, which takes away the need to manage everything with a master worker. Instead, the code becomes much simpler and more human-readable. The code becomes more iterative,so you can chain together operations to get the result you want. Spark is also where you'll find our real-time data processing, and how we manage synchronisation issues between the different worker nodes.

The end section deals with how we can use this data to come up with insights. Concepts like Machine Learning, page ranking and graph algorithms will be thrown about, but now in a new light. Now, we design our big data systems to support these algorithms. I quite liked this part as it's the more problem-solving part.

The workload is pretty low, just 2 assignments with 1.5 month deadlines. It's really easy and it takes about 30 minutes to complete each assignment. Though, they are VERY heavy at 25% each, so it's best to wait around before submitting your work, in case you have any careless mistakes.

Am I right in what I've babbled about? Probably not, as I don't really remember much from the course, and probably why I only got an A-. Or maybe the bell curve was actually that high? Who knows.