CS3245

Course Title

Information Retrieval

Grade

A

Semester

AY23/24 S2

Review

The course is about designing indexes to help people get the right information they need, similar to what Google Search Engine does. We start off building the simplest bigram models, which counts the frequency of 2-character sequences in a document. They help to identify the document, enabling us to search for it later (I want a document containing 'Bi' and 'is').

We then move on to more complex models like Boolean Retrieval and Postings Lists. Boolean retrieval is a method of information retrieval that uses Boolean logic (AND, OR, NOT) to match documents containing specified keywords (I want a document containing 'Bishan' and 'Resident'). Posting lists, on the other hand, make the boolean retrieval process faster, allowing us to merge queries together to reduce the search space (eg there is a difference in evaluating ('A' AND 'B') AND 'C' vs 'A' AND ('B' AND 'C')).

After that, we have tolerant retrieval, which basically allows us to search for close but not direct matches, using techniques like prefix searches and wildcard queries (I want a document containing *mon).

The next topic is about index construction and compression, which is about how to store the index in a way that is both space-efficient and time-efficient. We learn about how to store the index in a way that allows for fast retrieval, but also allows for fast updates. This is a very tricky balance, as the more space-efficient the index is, the slower the retrieval time, and vice versa.

Then comes the Vector Space Model, which is the main bulk of the course. The Vector Space Model is a way to represent documents as vectors, and queries as vectors, and then to find the most similar document to the query. If the document vector is similar enough to the query vector, then it will be returned as a result. The later topics all reinforce this model, like improving it with relevance feedback and query expansion.

In terms of workload, I'd say it's pretty high. There are 4 Homework Assignments in total, and they are pretty long. Performance is extremely important here, as you will be graded mainly on how accurately your search engines return the correct documents. There's alot of tweaking involved, and you have to be pretty comfortable with working with Python and file descriptors. The finals were not too bad, and they're quite manageable. Just be sure to know your concepts well, and you should be fine.