CSCI312 Big Data Management
School of Computing & Information Technology University of Wollongong CSCI312 Big Data Management Singapore 2020-2 Assignment 2
The objectives of Assignment 2 include conceptual modelling of a data warehouse, logical design of a data warehouse and implementation of a data warehouse as a collection of internal tables and external tables in Hive, querying a data cube, implementation of HBase table and querying and manipulating data in HBase table,
This assignment is worth 30% of the total evaluation in the subject.
Only electronic submission through Moodle at:
will be accepted. All email submissions will be deleted and mark 0 ("zero") will be immediately granted for Assignment 2. A submission procedure is explained at the end of Assignment 2 specification.
A policy regarding late submissions is included in the subject outline.
Only one submission of Assignment 2 is allowed and only one submission per student is accepted.
A submission marked by Moodle as "late" is always treated as a late submission no matter how many seconds it is late.
A submission that contains an incorrect file attached is treated as a correct submission with all consequences coming from the evaluation of the file attached.
All files left on Moodle in a state "Draft(not submitted)" will not be evaluated.
A submission of compressed files (zipped, gzipped, rared, tared, 7-zipped, lhzed, … etc) is not allowed. The compressed files will not be evaluated.
The second assignment is an individual assignment and it is expected that all its tasks will be solved individually without any cooperation with the other students. However, it is allowed to declare in the submission comments that a particular component or task of this assignment has been implemented in cooperation with another student. In such a case evaluation of a task or component may be shared with another student. In all other cases plagiarism will result in a FAIL grade being recorded for entire assignment. If you have any doubts, questions, etc. please consult your lecturer or tutor during laboratory/tutorial classes or over e-mail.
Task 1 (8
Conceptual modelling of a data warehouse
An objective of this task is to create a conceptual schema of a sample data warehouse domain described below. Read and analyse the following specification of a data warehouse domain.
A group of hospitals would like to create a data warehouse to store information about the medical examinations performed on the patients and later on to analyse the contents of a data warehouse. It is expected that the planned data warehouse will contain historical information collected over a long period of time.
A data warehouse supposed to contain information about the medical examinations, patients, medical staff conducting medical examinations and including nurses and doctors, hospitals, cities and countries the patients come from and hospitals are located at, and locations of medical examinations.
The patients are described by a medical insurance number, first name, last name, data of birth and address. A medical insurance number identifies each patient. An address consists of building number, street name, city, and country.
The medical examinations are performed in the hospitals. A hospital is identified by its unique name, and described by city and country it is located at. A medical examination is performed in a medical examination room. A room number, building level, and building number describe a medical examination room.
A medical staff member is described be an employee number, first name, last name, and data of birth. A medical staff member belongs to one of the hospitals. A medical staff member is either a nurse or a doctor. Nurses are additionally described by qualifications and doctors are described by a specialization.
A medical examination may involve an external medical expert whose role is to provide an independent opinion. An external medical expert is described by a title, first name, las name and a name of organization he/she belongs to.
A medical examination is described by date and time when performed and the measurement of the parameters like weight, blood pressure, cholesterol levels, and the others (please insert at least two other measurements of your choice).
A data warehouse must be designed such it would be possible to easily implement the following classes of applications.
- Find the total number of medical examinations performed per year, per month and per day, per patient, per nurse, per doctor, per external expert who participated in
the examinations, per hospital, per city, per country, etc. For example, it should be to find the total number of medical examinations performed in each month of 2017 in each hospital.
- Find the aggregations of the measures obtained from the medical examinations per year, per month and per day, per patient, per nurse, per doctor, per external expert who participated in the examinations, per hospital, per city, per country, etc. For example, it should be possible to find an average blood pressure of all patients older than 70 per last 12 months.
To create a conceptual schema of a sample data warehouse domain follow the steps listed below.
Step 1 Find a fact entity, find the measures describing a fact entity.
Step 2 Find the dimensions.
Step 3 Find the hierarchies over the dimensions.
Step 4 Find the descriptions (attributes) of all entity types.
Step 5 Draw a conceptual schema.
To draw a conceptual schema, use a graphical notation explained to you in a presentation 11 Conceptual Data Warehouse Design. To draw your diagram, you must use UMLet diagram drawing tool and apply a "Conceptual modelling" notation, Selection of a drawing notation is available in the right upper corner of the main menu of UMLet diagram drawing tool.
A file solution1.pdf with a drawing of a conceptual schema.
Task 2 (4 Logical modelling of a data warehouse
Consider the following conceptual schema of a data warehouse.
Perform a step of logical design to transform a conceptual schema given above into a logical schema (star schema). Use UMLet diagram drawing tool and apply a "Logical modelling" notation to draw a logical schema. Selection of a drawing notation is available in the right upper corner of the main menu of UMLet. Save a diagram of logical schema in a file solution2.uxf and export it to a file solution2.pdf.
A file solution2.pdf with a drawing of a logical schema.
Task 3 (6 Implementation of a data warehouse as a collection of external tables in Hive
Consider the following two-dimensional data cube.
The data cube contains information about that parts that can be shipped by the suppliers.
Download and unzip a file task3.zip. You should obtain a folder task3 with the following files: part.tbl, supplier.tbl, partsupp.tbl.
Use an editor to examine the contents of *.tbl files. Note, that the contents of the files can be loaded into the relational tables obtained from the transformation of the twodimensional data cube given above into the relational table PART, SUPPLIER, and PARTSUPP.
Transfer the files into HDFS.
Implement HQL script solution3.hql that creates the external tables obtained from a step of logical design performed earlier. The external tables must overlap on the files transferred to HDFS in the previous step. Note, that a header in each *.tbl file must be removed before creating the external tables.
Include into solution3.hql script SELECT statements that any 5 rows from each one of the external tables implemented in the previous step and the total number of rows included in each table.
When ready, use a command line interface beeline to process a script solution3.hql and to save a report from processing in a file solution3.rpt.
A file solution3.rpt with a report from processing of HQL script solution3.hql.
Task 4 (4
Querying data cube
Download and unzip a file task4.zip. You should obtain a folder task4 with the following files: dbcreate.hql, dbdrop.hql, partsupp.tbl, lineitem.tbl, and orders.tbl.
A file orders.tbl contains information about the orders submitted by the customers. A file lineitem.tbl contains information about the items included in the orders. A file partsupp.tbl contains information about the items and suppliers of items included in the orders.
Open Terminal window and use cd command to navigate to a folder with the just unzipped files. Start Hive Server 2 in the terminal window (remember to start Hadoop first). When ready process a script file dbcreate.hql to create the internal relational tables and to load data into the tables. You can use either beeline or SQL Developer. A script dbdrop.hql can be used to drop the tables.
The relational tables PARTSUPP, LINEITEM, ORDERS implement a simple twodimensional data cube. The relational tables PARTSUPP and ORDERS implement the dimensions of parts supplied by suppliers and orders. A relational table LINEITEM implements a fact entity of a data cube.
Implement the following queries. In each case, try to find the most efficient implementation.
- For the order clerks (O_CLERK) Clerk#000000522, Clerk#000000154, find the total number of ordered parts per order date (O_ORDERDATE), per supplier (L_SUPPKEY), per order date and supplier (O_ORDEDATE, L_SUPPKEY), and the total number of ordered parts.
- For the parts with the keys (L_PARTKEY) 7, 8,9 find the largest discount applied (L_DISCOUNT) per part key (L_PARTKEY) and per part key and supplier key (L_PARTKEY, L_SUPPKEY) and the largest discount applied at all.
- Find the smallest quantity (L_QUANTITY) per order year (O_ORDERDATE), and order clerk (O_CLERK).
- Find the smallest tax (L_TAX) applied per supplier key (L_SUPPKEY).
- Find an average quantity (L_QUANTITY) and the largest quantity (L_QUANTITY) per part key (L_PARTKEY), and per part key, and order year (L_PARTKEY, O_ORDERDATE). Consider only parts with the keys 5, 6, 7, 8, and 9.
When ready, save your SELECT statements in a file solution4.hql. Then, process a script file solution4.hql and save the results in a report solution4.rpt.
A file solution4.rpt that contains a report from processing of SELECT statements.
Task 5 (4
Implementation of HBase table
Implement as a single HBase table a database that contains information described by the following conceptual schema.
(1) Create HBase script solution5.hb with HBase shell commands that create HBase table and load sample data into the table. Load into the table information about at least two parts and two suppliers and such that one supplier supplies one part and the other supplier supplies 2 parts.
When ready use HBase shell to process a script file solution5.hb and to save a report from processing in a file solution5.rpt.
A file solution5.rpt that contains a report from processing of solution5.hb script with the statements that create HBase table and load sample data.
Task 6 (4 marks)
Querying and manipulating data in HBase table
Consider a conceptual schema given below. The schema represents a simple database domain where students submit assignments and each submission consists of several files and it is related to one subject.
Download a file task6.hb with HBase shell commands and use HBase shell to process it. Processing of task6.hb creates HBase table task6 and loads some data into it.
Use HBase shell to implement the following queries and data manipulations on the HBase table created in the previous step. Save the queries and data manipulations in a file solution6.hb.
- Find all information about a student number 007, list one version per cell.
- Find all information about a submission of assignment 1 performed by a student 007 in a subject 312, list one version per cell.
- Find the first and the last names of all students, list one version per cell.
- Find all information about a student whose last name is Bond, list two versions per cell.
- Delete a column family FILES.
- List information about all submissions performed by the students.
- Increase the total number of versions in each cell of a column family SUBJECT.
- Delete HBase table task6.
When ready, start HBase shell and process a script file solution6.hb with Hbase command shell. When processing is completed copy the contents of Command window with a listing from processing of the script and paste the results into a file solution6.rpt. Save the file. When ready submit a file solution6.rpt.
A file solution6.rpt with a listing from processing of a script file solution6.hb.