Data Analysis Coursework Assignment

Data Analysis

Coursework – Assignment

Weighting of assessment: 100% total marks

Word Limits: 6000 Words

Aim(s)

The aim of this module is to help students acquire skills for job roles of Data Scientist, Data Modellers and Data Analyst and to enable them to understand and implement various statistical and computational techniques for analysing datasets using various industry standard software and programming languages.

Learning Outcomes

After completing this module the student should be able to:

  • Critically analyse and evaluate various statistical and computational techniques for analysing datasets and determine the most appropriate technique for a business problem;
  • Critically evaluate, develop and implement solutions for processing datasets and solving complex problems in various environments using relevant programming paradigms; Appraise and apply key steps and issues involved in data preparation, cleaning, exploring, creating, optimizing and evaluating models;
  • Contrast and apply aspects of data science applications and their use.

Supermarket Sales Data Challenge

Overview

This dataset contains supermarket transactions over period of two years from 4 categories: Type 1 to Type 4. There are number of branches for this supermarket around two main provinces of the country.

As a data scientist, your task will be to clean, normalise and transform these data into R compatible formats and undertake an extensive data mining using Machine Learning. The main objective of this data challenge is to develop Machine Learning model to get various transaction patterns, sales forecasting using the following four (4) data sets. These data sets contain two years of transaction details. Report on any interesting patterns, buying patterns, market-basket analysis that you may reveal from the data analysis and possible visualisations. In your discussion, you will provide a critical synopsis of the challenges of data analysis, integration and visualisation you faced during this exercise. You will provide relevant assumptions you made with valid justifications during this exercise.

Datasets

Four (4) data sets have been provided for Item, Sales Promotion and Supermarkets.

Item.csv

This dataset contains information about items for sale, which contains the following fields.

  • Code
  • Description
  • Type
  • Brand
  • Size

Sales.csv

Two years of sales transactions, which contains the following fields.

  • Code
  • Amount
  • Units
  • Time of transactions
  • Province
  • CustomerID
  • Supermarket No
  • Basket
  • Day
  • Voucher

Promotion.csv

This dataset contains various sales promotions, on various items in different supermarkets, which contains the following fields.

  • Code
  • Supermarket No
  • Week
  • Feature
  • Display
  • Province

Supermarkets.csv

This dataset contains supermarket store location details, which contains the following fields.

  • Supermarket No
  • Post-code

Please note, NO any other information provided, on the data definitions or meaning of the fields. You may have to explore and identify the meaning and relationships with other datasets.

Assignment tasks and marking criteria

Task

Description

Marks

Data description

Provide detailed description of each datasets, their properties and relationships

5%

Collecting data

Read data from csv files to R environment for processing

5%

Data cleaning, Exploring and preparing the data

Clean any outliers, exceptional values from the datasets

Normalizations, Scaling

Merge the datasets

Create training and test datasets

35%

Apply Machine Learning and Model building

Training a model on the data

Apply different Machine Learning approaches and discuss

20%

Evaluating model performance

Accuracy of the each different models

10%

Improving model performance

Alternative ways of normalizations, model building, and their performances

10%

Comparative analysis

Patterns identified and their visualizations

Describe a detailed comparative analysis between the scaling, Machine Learning approaches – strengths, limitations, uniqueness

Comparative analysis should be in relation to

  • Integration, transformation, visualization and data mining

10%

Discussion

Provide a brief discussion about the knowledge gained

5%

What to submit

Detailed report consisting of each of the above tasks, relevant R statements with relevant comments. Before showing any R statement, explain in detail. Visualization models where necessary for storytelling. Attach a CD, which contains ALL your workings, datasets, and merged datasets, if any

Referencing Requirements

All referencing should utilize the Harvard Style.

REPORT STRUCTURE

Paper Size A4
Word Count 6000 words
Printing Margins LHS; RHS: 1 Inch
Binding Margin ½ Inch
Header and Footer 1 Inch
Printing Single Sided
Basic Font Size 12
Font Style Arial/Times New Roman
Presentation Bound Document