Logical design for data warehousing hbase data model

Topic includes:
MapReduce, MapReduce data processing model, Java MapReduce Application Hadoop Architecture HDFS Interfaces
Hive, Hive data structures,
Data Warehouse concepts, Conceptual Design, Logical Design.

SQL for Data Warehousing
HBase Data Model, HBase Operation,
Apache Pig, Pig Latin
Spark, Spark Data Model, Spark Operation, Stream processing.

You must specify the key-value data in the input and output of the Map and Reduce stages.

Insert at least 5 rows into the relational table created in the previous step. Two employees must participate in few projects and must know few programming languages. One employee must participate in few projects and must not know any programming languages. One employee must know few programming languages and must not participate in any projects.

One employee must not know programming languages and must not participate in the projects.

The data cube contains information about that parts that can be shipped by the suppliers. Download and unzip a file task4.zip. You should obtain a folder task4 with the following files: part.tbl, supplier.tbl, partsupp.tbl.

When ready, use a command line interface beeline to process a script solution4.hql and to save a report from processing in a file solution3.rpt.

Deliverables

Open Terminal window and use cd command to navigate to a folder with the just unzipped files. Start Hive Server 2 in the terminal window (remember to start Hadoop first). When ready process a script file dbcreate.hql to create the internal relational tables and to load data into the tables. You can use either beeline or SQL Developer.

A script dbdrop.hql can be used to drop the tables.

(4) For each part list its key (PS_PARTKEY) and all its available quantities
(PS_AVAILQTY) sorted in descending order and a rank (position number in an ascending order) of each quantity. Consider only the parts with the keys 10 and 20. Use an analytic function ROW_NUMBER().

(5) For each part list its key (PS_PARTKEY), its available quantity, and an average available quantity (PS_AVAILQTY) of the current quantity and all previous quantities in the ascending order of available quantities. Consider only the parts with the keys 15 and 25. Use

The application must have the following two parameters.

(1) A path to and a name of file to be moved from.

(3) Upload to HDFS a small text file for the purpose of future testing. A name and location of the file in HDFS is up to you.

(4) Use Hadoop to process your application that moves a file on HDFS from one location to the other.

Your task is to implement a MapReduce application, that finds the total rainfall in each state, the largest rainfall in one location in each state and the smallest rainfall in one location in each state.

Deliverables

Sample Questions & Answers
Implementation of MapReduce application
Consider a classical MapReduce application that counts the total number of occurrences of words in a given text. For example, look at WordCount application available in a file WordCount.java in Laboratory 2.

Deliverables

Sample Questions:
(1) Build a conceptual model from a provided specification.

A call program is described by a unique name, price per call, and a short description. A

telephone call is described by a phone number of a customer who issued a call (caller), phone

(2) Translate a conceptual model into a logical model.

Sample conceptual model:

1.Find the total quantities summarized per part and supplier, per part, and the total quantity of all orders
2.Find an average discount per part and supplier, per part, per supplier, and an average discount of all orders.

Present the Scala source code to implement the following queries on invoiceDF: (a) return the

total number of rows, (b) return the number of unique stock codes, (c) return the average unit

Assume that '|' (vertical bar) has been used to separate data items in each row in the data files customer.txt, product.txt, and order.txt.

Assume that the data files customer.txt, product.txt, and order.txt have been uploaded into HDFS.

Assume that '|' (vertical bar has been used to separate data items in each row in the data files customer.txt, product.txt, and order.txt.

Assume that the data files customer.txt, product.txt, and order.txt has been uploaded into HDFS.

You need to compute the average number of flights arriving at each country specified in the DEST_COUNTRY_NAME column of flightDF.

Explain your operations and write down your code.

(4) Explain how you compile the source code of a self-contained application and submit it to a Spark cluster. Support your answer with Terminal commands.

(2) Compute an average applied discount (DISCOUNT) for the parts with the keys (PARTKEY) 1001 and 1002 and list the part keys and the average applied discount.

(3) For each part list its key (PARTKEY), all its applied discounts, and an average applied discount (DISCOUNT) of the current discount and the previous one in the ascending order of available discounts.

(2) Write HBase shell commands that implement the following queries.

(i) Find all information about the accidents having damages higher than 1000. (ii) List the first and the last name of people involved in accidents in Sydney in 2019.

You do not need to re-write entire application. However, if Java code is more convenient for you than written English then you are free to write Java code.

Sample Questions:
Read and analyse a specification of data warehouse domain listed below.

Employees are described be employee number (unique), first name, and last name. Mechanics are described by licence number (unique), junior mechanics are described the titles of courses completed.

Facilities are described by address in a city and cities are described by name (unique).

(1) Use HBase shell command language to write the commands that create HBase table implementing a two-dimensional data cube given above.

(2) Write the commands of HBase shell command language that insert into HBase table created in the previous step information about at least 2 orders submitted by the same customer and including two different products.

Assume, that a speed limit in the location of the speed camera is 60 kilometres per hour.

Your task is to explain how to implement a MapReduce application, that finds an average speed of all cars, that exceeded a speed limit in the location of the speed camera. You must specify the parameters (if any) of your application and the key-value data in the input and output of the Map and Reduce stages.

Write Pig-Latin statements that perform the following operations. For (5.2) and (5.3), also present the correct output.

(5.1) Load datasets by using the provided relation names and field names. The fields of each relation must have the suitable types.