Sql Assignment Question

IE 332 - Homework #2

Read Carefully. Important!

As outlined in the course syllabus this homework is worth 7% of your final grade. The maximum attainable mark on this homework is 140. As was also outlined in the syllabus, there is a zero tolerance policy for any form of academic misconduct. The assignment can be done individually or in pairs.

By electronically uploading this assignment to Blackboard you acknowledge these statements and accept any repercussions if in any violation of ANY Purdue Academic Misconduct policies. You must upload your homework on time for it to be graded. No late assignments will be accepted. Only the last uploaded version of your assignment will be graded.

NOTE: You should aim to submit no later than 30 minutes before the deadline, as there could be last minute network traffic that would cause your assignment to be late.

When submitting your assignment it is assumed that every student considers the below checklist, as there are grading consequences otherwise (e.g., not submitting a cover sheet is an automatic grade of ZERO).

Attach a cover sheet (see Blackboard) as the first page of your submission.

Submit page i of this assignment as the second page of your submission.

Your solutions have style (see Q1 in the assignment).

All of your solutions (program code, etc.) are included in the submission.

All of your source code is included as requested (.R, etc.).

You have not included any screen shots, photos, etc. (plots from R should be intermediately saved as .png files).

All math notation and expressions are created using an equation editor (no pictures, handwritten solutions, etc.).

If using Word or other text processor, convert the output to .pdf and ensure that it does not contain any aesthetic or other errors. Submit only the pdf version.

If using LATEX, the source code is separately submitted in a .zip file. If using LATEX correctly, there is a 7 point bonus. Failure to submit the source code voids potential bonus.

If submitting with a partner, BOTH must have uploaded the SAME assignment by the due date. Watch videos on creating pseudocode if you need a refresher or quick reference to the idea. These are good starter videos:

(0 points) These are style points meant to enforce the skill of communicating technical information in a precise, concise and easily interpretable way. You are penalized for (a) using poor grammar/spelling, (b) disorganized presentation of solutions (including organizing your code into functions, as appropriate), (c) not commenting well your source code, (d) not using meaningful variable names in your code. At the discretion of the TA (who should be grading this “hard”). The presumption is that you do not do any of these things, and so doing them will cost addition points (up to -10). Your goal is to get 0/0 on this question.

The assignment should have margins between 0.5 and 0.75 inches wide, with font size no larger than 12pt (11pt for code) and no smaller than 10pt - sub/superscripts, figure and plot captions excluded, but should be clearly legible. Clearly label each question.

If a question requires more than 40% of the page to answer, then that is the only answer on that page. If multiple pages are required, this rule applies to the last of those pages.

Consider a distributed media sharing system. In this system, users store music, pictures and videosfiles, where each such file will have a unique ID (f id). Additionally:

  • A user has user name, e-mail address, phone number, and user ID. When user accesses a file, the system creates and stores an associates time stamp with each access.
  • A song is written by an artist, included in an album, has a title, and size (in MB). An album has a title, release year and an ID. Artists have a name and an ID (art ID).
  • A picture has the date and location when/where it was taken, a file size (MB), and type (jpg, png, etc.). Additionally, the system records who is in the picture.
  • A video has the date and location when/where it was taken, length (seconds), and file size (MB). The system also records the people who are in the video.

Each file belongs to only one user. A user can have many files and access those files for different reasons: upload, view or convert. The system will track the time and date of users’ accesses. An artist can have more than one song and album, but can only publish one album with the same title in a year. However, multiple artists can publish an album with the same title in the same year. An album will have at least one song, and a song can be converted to a video by adding some pictures to it.

The list of people who appear in a picture or a video is created automatically by external software, but in the database they are each assigned an ID and a name (user name) (if found in the system). The system allow users to create photos from a video by clipping and music by subtracting visual elements from a video.

Users can interact with each other by adding them as their friend. In this case, users can access their friends’ files. For each such friendship, the system keeps track of the date the friendship started.

  1. (13 points) Create a representative ER Diagram using Chen notation. Be sure to clearly andappropriately indicate any keys, weak entities, relationship constraints, etc.
  2. (13 points) Create a representative ER Diagram using Crow’s Foot notation. Be sure to clearlyand appropriately indicate any keys, weak entities, relationship constraints, etc.
  3. Consider the following schema of a simplified shipment system of an e-commerce firm, where theunderlined values indicate the primary key:

Products(ProductID:int, CategoryID:int, CategoryName:varchar(50),

ProductName:varchar(50), Price:real, Discounts:real, Quantity:int, Views:int, Likes:int)

Each product has a unique ID, belongs to only one category (identified by ID and name), has name,price, discounts, quantity, views and likes.

Customers(CustomerEmail:varchar(100), CustomerName:varchar(100), Phone:varchar(50),

Address:varchar(100), Zipcode:varchar(10), City:varchar(50), State:varchar(2))

Each costumer is identified by their email address, has an address (with zipcode, city and state).

Orders(OrderID:Integer, OrderDate:date, CustomerEmail:varchar(100), ProductID:int,

Quantity:int, Tax:real, OrderTotal:real, Status:varchar(50),

PaymentMethod:varchar(50))

Each order has a unique ID, a date, a customer (identified by email address), a product (and itsquantity), an order total (excluding tax), an amount of tax charged on the order, a status (InProgress, Shipped, Delivered), and a payment method (Credit Card, Paypal). If there are multiple products in the same order, there will be a row for this order ID for each of the products ordered (where order total/tax remain the same in all of those rows).

Shipments(TrackingNumber:varchar(50), OrderID:varchar(50), ProductID:int,

Type:varchar(50), Carrier:varchar(50), ShippingDate:date)

When (a subset of) the products in an order are shipped, a shipment tracking number is issued.Each shipment has a shipping method (Regular, Express), is shipped using a carrier (Fedex, USPS, UPS) and has a shipping date. Note that products in the same order can be shipped in different shipments. If there are multiple products in the same shipment, there will be a row for this tracking number and order ID for each of the products ordered (where type, carrier and shipping date remain the same in all of those rows).

(15 points) Provide all the CREATE TABLE statements needed to create the database.

Provide SQL statements to answer the following questions from the database above:

  • (2 points) How many products does the company sell?
  • (2 points) Provide a list of the distinct states where the customers are located at.
  • (5 points) List how many orders each carrier delivered in the year of 2018 where the ordertotal was greater than $1000.
  • (5 points) List the email address of the customers that made more than 100 orders in 2018,with the number of orders that each of these customers made. Order the list from highest number of orders to lowest number of orders.
  • (5 points) List the name, ID and quantity sold of the 10 products that sold the most in thecategory Furniture in 2018 that have “bed” in the name. Order in decreasing order of the quantity sold and break ties in alphabetical order by product name.
  1. (5 points) Does the schema violate any of the Normal Form (NF) rules? If yes, indicate thewhere a rule was violated, explaining which of the NF rules was violated and why, and how it could be fixed.
  2. Consider the schema from Q3. For each English statement below indicate whether the associatedquery is correct. If it is incorrect provide a fix to the query. If it is correct, simplify the query to achieve the same result but with less complicated SQL code.
  3. (5 points) List the email addresses of the customers who bought either the “Modern diningtable” or the “Modern dining chair”.

SELECT c.CustomerEmail

FROM Customers c

LEFT JOIN Orders o ON c.CustomerEmail=o.CustomerEmail

LEFT JOIN Products p ON o.ProductID=p.ProductID

WHERE ProductName="Modern dining table" AND ProductName="Modern dining chair";

(5 points) List the ID of the orders and the number of shipments in that order for all the ordersthat were shipped in more than one shipment. Order the list from highest number of shipments to lowest number of shipments.

SELECT o.OrderID, Count(o.OrderID)

FROM Orders o

INNER JOIN Shipments s ON o.OrderID=s.OrderID

GROUP BY o.OrderID

ORDER BY o.OrderID DESC;

(5 points) What number of ordered products were liked by more than 150 customers.

COUNT(p.ProductID) FROM Customer c INNER JOIN Orders o

ON c.CustomerEmail = o.CustomerEmail INNER JOIN Products p ON p.ProductID = o.ProductID WHERE COUNT(p.Likes) > 150.

(5 points) Provide ProductIDs that have been ordered more than 100 times, have more than a$20 discount, and was shipped by USPS.

SELECT p.ProductID FROM Products p INNER JOIN Shipments s

ON s.ProductID = p.ProductID WHERE EXISTS

(SELECT o.ProductID FROM Orders o HAVING COUNT(o.ProductID) > 100) AND p.Discount > $20 AND s.Carrier = USPS

(5 points) Select the Products that have more than 100 Views or more than 20 Likes. Sort theProductName by the total Quantity ordered.

SELECT ProductID, ProductName, SUM(Quantity)

FROM Products, Orders

WHERE Products.ProductID = Orders.ProductID

ORDER BY SUM(Quantity)

GROUP BY ProductID

For this question you will compare the performance of Naive Bayes to Decision Trees for classifyingsoccer players (csv posted on Blackboard). The overall rating column is an accumulation of all other fields for each player. All other columns are potential predictors for the value of overall rating. We will choose certain predictors to learn a classifier for the overall rating column. Be sure to first install the RMySQL, e1071, tree, and ROCR packages.

(2 points) The first 4 lines of the pa.csv file are as follows.

Please skip this line

player api id

date

overall rating

potential

preferred foot

###The previous line is the header

505942

2/18/2016 12:00:00 AM

67

71

right

Table 1: First 4 lines of the player.csv file

Please use the function read.csv and its arguments skip and comment.char to read the data properly into a data frame in R called raw.df. You only need one line of R code.

(2 points) Remove the all rows which contain NA values in the any column of the df dataframe and name the new dataframe as nona.df. Only one line of R code.

(3 points) Remove the rows where the value of the overall rating column lie outside of 3IQR (Interquartile Range), where IQR is the difference between 75th and 25th quartiles, you could use the stats function and its argument coef in R. Name the new dataframe after removing the outliers as nonout.df. Only 2 lines of R code needed.

(5 points) Compute the correlation between field overall rating and all the other numeric Select four highest correlation and store the corresponding columns combined with player api id and overall rating into a new data frame called highcor(structure of highcor is as the following). No more than 10 lines of R code.

player api idcol1col2col3col4overall rating

(4 points) Write the highcor data frame into a SQL table called highcor using RMySQL. The highcor table contains multiple records for the same player (examples are as follows). Please compute the grouped mean for each predictor column grouped by player api id. Please use the GROUP BY query combined with dbGetQuery function in RMySQL to store the result in a dataframe called pa in R. When submitting your code you can cut out any CAREER account passwords. No more than 4 lines of R code.

player api id

reactions

potential

short passing

ball control

overall rating

505942

47

71

61

49

67

505942

47

71

61

49

67

505942

47

66

61

49

62

155782

67

76

71

71

74

155782

67

76

71

71

74

(3 points) Use the mean of overall rating column in unique.pa as the threshold to create a 2-level factor column for this column with labels 1 and 0. Then create the final dataframe for classification with the four predictors and this factor column. No more than 3 lines of R code.

For the below classifiers, your training set will contain a random selection of 75% of the data, the testing set will be composed of the remaining 25%.

(5 points) Use the naiveBayes function in the e1071 package, and learn a classifier to determine the corresponding player’s rating. You may want to caret package to create a data partition, but can also do so using an alternative approach. No more than 9 lines of R code.

(3 points) Compute the correlation matrix between all four predictors, and plot the correlationvalues using a heatmap (can use ggplot2 and reshape libraries if you want). Based on the correlation matrix, evaluate the caveats to use Naive Bayes on these predictors. No more than 5 lines of R code needed.

(2 points) Use the tree function in the tree package, and learn a classifier to determine the corresponding player’s rating. No more than 2 lines of R code needed.

(8 points) Report the Confusion matrix and ROC curve(use ROCR package in R) for both classifiers. No more than

Watch https://www.youtube.com/watch?v=nOnsbd7rdhc and answer the following questions.

  1. (1 point) Technically speaking, what is Siri?
  2. (1 point) Has HAL-9000 actually been created by NASA?
  3. (1 point) What is Alexa limited by?
  4. (1 point) How many ultrasonic sensors are on a Tesla?
  5. (1 point) Amazon is an example of what kinda of AI?
  6. (1 point) What video game inspired the name of Microsoft’s Cortana?
  7. (1 point) How many musical characteristics does Pandora break a song into?

Watch https://www.youtube.com/watch?v=fsxKUby4EEw and answer the following questions.

  1. (1 point) How many years are we away from AI being sentient?
  2. (1 point) What may be AI’s greatest promise?
  3. (1 point) What is an example of a healthcare application of AI?
  4. (1 point) What are the classes of data for the agriculture application?
  5. (1 point) How many images of weeds have been gathered?
  6. (1 point) Give an example of robots in a retail space.