SIT772 database and information

Assignment 2

Introduction

  • This assessment is for students to develop the understanding of information retrieval techniques.
  • This is an individual assessment task.
  • The project documentation submitted should include the answers, working, associated tables and graphs to the tasks
  • This assignment has a total of 20 marks and is worth 20% of your final result

Unit Learning Outcomes

  • Of the three Unit Learning Outcomes (ULOs) of this unit SIT772, this assessment task will focus on the last two ULOs. These are: o ULO 3 - At the end of this unit students will be able to design and develop relational databases by using SQL and a database management system. o ULO 5 - Develop problem solving skills in the context of data processing systems.

o ULO 6 - Work independently on self-directed learning tasks

  • The assessment of this task will indicate whether students can partially attain these unit learning outcomes.

Instructions

  • Read these instructions and the following 2 questions.
  • Answer as many questions as possible.
  • Place your name, ID and answers in your document.
  • Please answer all questions in a single document and submit this to the assignment folder.

Task 1: Zipf’s Law (5+5=10 Marks)

  1. Provide a brief description of Zipf’s Law and how this is related to information retrieval (searching for term/words in a corpus).
  2. Assuming Zipfs law with the most frequent term appeared 20% of word occurrences. What is the fewest number of most common words that together account for more than 60% of word occurrences (i.e. the minimum value of m such that at least 60% of word occurrences are one of the m most common words). You can use a table to help present your result.

Task 2: Information Retrieval (IR) Evaluation (3+3+4=10 Marks)

The following data displays retrieval results for two different algorithms (Algorithm 1 and Algorithm 2) in response to two distinct queries (Query 1 and Query 2). An expert has manually labelled each of the documents as being either relevant or not relevant to the queries.

Algorithm 1 Returns the following results:

Query 1:

d33

d6

d9

d48

d56

d76

d10

d29

d30

d5

d11

d66

d3

Query 2:

d10

d76

d5

d67

d13

d45

d91

d16

d17

d22

d20

d71

d48

d60

d25

d27

Algorithm 2 Returns the following results:

Query 1:

d44

d41

d7

d77

d13

d14

d90

d80

d70

d4

d8

d29

d6

d5

d15

d17

d20

d65

d2

d33

Query 2:

d9

d91

d99

d30

d17

d13

d26

d93

d42

d79

d12

d10

d41

d11

d85

d89

d1

d49

d52

d76

d20

d43

d88

d7

d98

d51

d50

d6

d3

d87

d2

d28

d15

d14

An expert has identified the following documents as being relevant to Query 1 and Query 2, respectively.

Relevant to Query 1:

d8

d13

d29

d33

d41

Relevant to Query 2:

d2

d3

d7

d8

d9

d11

d12

d13

d15

d16

d20

Objectives:

  1. For Algorithm 1, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Algorithm 1 (all three curves should be on a single chart).
  2. For Algorithm 2, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Algorithm 2 (all three curves should be on a single chart, but a separate chart from that used in part (a)).
  3. Plot the averages for Algorithm 1 and Algorithm 2 on a separate chart, and compare the algorithms in terms of precision and recall. Do you think one of the algorithms is superior? Provide a brief explanation of why this is the case?