Live Chat

# SIT772 database and information Assignment 2

## Introduction

• This assessment is for students to develop the understanding of information retrieval techniques.
• This is an individual assessment task.
• The project documentation submitted should include the answers, working, associated tables and graphs to the tasks
• This assignment has a total of 20 marks and is worth 20% of your final result

### Unit Learning Outcomes

• Of the three Unit Learning Outcomes (ULOs) of this unit SIT772, this assessment task will focus on the last two ULOs. These are: o ULO 3 - At the end of this unit students will be able to design and develop relational databases by using SQL and a database management system. o ULO 5 - Develop problem solving skills in the context of data processing systems.

o ULO 6 - Work independently on self-directed learning tasks

• The assessment of this task will indicate whether students can partially attain these unit learning outcomes.

## Instructions

• Read these instructions and the following 2 questions.
• Answer as many questions as possible.
• Place your name, ID and answers in your document.
• Please answer all questions in a single document and submit this to the assignment folder.

### Task 1: Zipf’s Law (5+5=10 Marks)

1. Provide a brief description of Zipf’s Law and how this is related to information retrieval (searching for term/words in a corpus).
2. Assuming Zipfs law with the most frequent term appeared 20% of word occurrences. What is the fewest number of most common words that together account for more than 60% of word occurrences (i.e. the minimum value of m such that at least 60% of word occurrences are one of the m most common words). You can use a table to help present your result.

### Task 2: Information Retrieval (IR) Evaluation (3+3+4=10 Marks)

The following data displays retrieval results for two different algorithms (Algorithm 1 and Algorithm 2) in response to two distinct queries (Query 1 and Query 2). An expert has manually labelled each of the documents as being either relevant or not relevant to the queries.

Algorithm 1 Returns the following results:

 Query 1: d33 d6 d9 d48 d56 d76 d10 d29 d30 d5 d11 d66 d3 Query 2: d10 d76 d5 d67 d13 d45 d91 d16 d17 d22 d20 d71 d48 d60 d25 d27 Algorithm 2 Returns the following results: Query 1: d44 d41 d7 d77 d13 d14 d90 d80 d70 d4 d8 d29 d6 d5 d15 d17 d20 d65 d2 d33
 Query 2: d9 d91 d99 d30 d17 d13 d26 d93 d42 d79 d12 d10 d41 d11 d85 d89 d1 d49 d52 d76 d20 d43 d88 d7 d98 d51 d50 d6 d3 d87 d2 d28 d15 d14

An expert has identified the following documents as being relevant to Query 1 and Query 2, respectively.

 Relevant to Query 1: d8 d13 d29 d33 d41 Relevant to Query 2: d2 d3 d7 d8 d9 d11 d12 d13 d15 d16 d20

Objectives:

1. For Algorithm 1, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Algorithm 1 (all three curves should be on a single chart).
2. For Algorithm 2, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Algorithm 2 (all three curves should be on a single chart, but a separate chart from that used in part (a)).
3. Plot the averages for Algorithm 1 and Algorithm 2 on a separate chart, and compare the algorithms in terms of precision and recall. Do you think one of the algorithms is superior? Provide a brief explanation of why this is the case?