MBAN 6110 Data Science Homework 1

Q1:Choose a problem from a past job, hobby, or interest that would make for a good predictive modeling classification application. Describe it in one page or less, using the relevant concepts introduced in classes 1 & 2 and Ch. 1 – 3 in the book. Your description should be as complete and precise as possible; referring to the concepts introduced in class/in the book. Please do not choose one of the applications we have discussed in detail already (churn, targeted marketing, credit scoring).

Include answers to the following:

What exactly is the business decision you want to support with this solution?
Describe the use phase.
Why did you select this as a good predictive modeling problem?
How and where would you get the data?
Explain precisely why and how you expect doing the predictive modeling will add value.
What exactly is the quantity that you inherently do not know and need to predict?
Is this a classification, ranking, or probability estimation problem?
What are the features? Provide a list of at least 5 features that you think (a) you can get and (b) you think might be useful.
What exactly would be your training data?

Schulich School of Business MBAN 6110 Data Science I Prof. Michael Chen

Q2: Try to give your own definition and description of the following problems. You may look up the textbook, Internet, or other resources. Your answer to each question shall not exceed one page.

What is the custom churn problem?
What is firmographic data? And how it is related to data mining?
What is a market basket problem?
Can you describe the online recommendation system?
What is the link prediction problem for social network? Can you give examples?
What is A/B testing? Describe its use in the online advertising setting.
What does the term “customer profiling” mean?
What is the placebo effect in data mining?
What is the OCR recognition problem? What are the major techniques involved?

Q3: Use Weka to conduct tree segmentation on the provided data churn.arff, and report the following:

Describe your understanding of the dataset verbally within one page limit.
Describe your eyeballing findings.
Run J48 with all default setting, and use Training set as the Test Option. What is the fitting accuracy?
Try different values of “minNumObj” to make a more interpretable decision tree, and describe your interpretation of the final decision tree.
Prepare a file “ReducedChurn.arrf” with the following attributes: COLLEGE, INCOME, LEFTOVER and LEAVE, and only the first 50 records. Include this file in the appendix A of your homework report.
Conduct the tree segmentation on the reduced dataset using EXCEL. Organize your EXCEL sheet such that you can print it on paper neatly. Clearly show and label each step of your calculation. Include the printout in your report appendix B. What is the accuracy of your decision tree on the reduced dataset?

Q4: Use Weka SimpleLogistic to analyze the churn.arff data, and report

What is the overall accuracy?
Write down the linear discriminant line equation, using the attribute names as the variable names.
For a customer with the following attributes: zero,28795,0,0,381539,284,0,12,very_unsat,very_little,no calculate your prediction use the linear discriminant equation.
Remove the first 10% records from churn.arff, and conduct the analysis a), b), c). Describe the change on the linear discriminant equation from what you get from b).
Now conduct the analysis a), b) on the ReducedChurn.arff you prepared in the previous question.
Remove the first 10% records from ReducedChurn.arff, and conduct the a) and b) analysis. Describe the change on the linear discriminant equation, and compare the change with the change in d).

Q5: Use Weka libSVM to analyze the churn.arff data, and report

What is the overall accuracy with default setting?
For a customer with the following attributes: zero,28795,0,0,381539,284,0,12,very_unsat,very_little,no calculate your prediction using Weka. Note that you need to prepare an .arff test file for this purpose.
(bonus question) Write down the discriminant equation, using the attribute names as the variable names. You will need to read auxiliary materials posted on course webpage to answer this question.