Question:
Approximately 1.65 million high school students take the Scholastic Aptitude Test (SAT) each year, and nearly 80 percent of the college and universities without open admission policies use SAT scores in making admission decisions. The current version of the SAT includes three parts: reading comprehension, mathematics, and writing.
A perfect combined score for all three parts is 2400. A sample of SAT scores for the combined three-part SAT are as follows:
1665 | 1525 | 1355 | 1645 | 1780 |
1275 | 2135 | 1280 | 1060 | 1585 |
1650 | 1560 | 1150 | 1485 | 1990 |
1590 | 1880 | 1420 | 1755 | 1375 |
1475 | 1680 | 1440 | 1260 | 1730 |
1490 | 1560 | 940 | 1390 | 1175 |
- Show a frequency distribution and histogram. Begin with the first bin starting at 800, and use a bin width of 200.
- Comment on the shape of the distribution.
- What other observations can be made about the SAT scores based on the tabular and graphical summaries?
(2+3+5 = 10 marks)
The following 20 observations are for two quantitative variables, x and y.
Observation | x | y | Observation | x | y |
1 | -22 | 22 | 11 | -37 | 48 |
2 | -32 | 49 | 12 | 34 | -29 |
3 | 2 | 8 | 13 | 9 | -18 |
4 | 29 | -16 | 14 | -33 | 31 |
5 | -13 | 10 | 15 | 20 | -16 |
6 | 21 | -28 | 16 | -3 | 14 |
7 | -13 | 27 | 17 | -15 | 18 |
8 | -23 | 35 | 18 | 12 | 17 |
9 | 14 | -5 | 19 | -20 | -11 |
10 | 3 | -3 | 20 | -7 | -22 |
- create a scatter chart for these 20 observations.
- Fit a linear trendline to the 20 observations. What can you say about the relationship between the two quantitative variables?
Use the data file DemoKTC file to conduct the following analysis:
- Use k-means clustering with a value of k = 3 to cluster based on the Age, Income, and Children variables.
- Repeat the k-means clustering for values of k = 2, 4, 5.
- How many clusters do you recommend? Why?
(8+ 6+6 = 20 marks)
A sociologist was hired by a large city hospital to investigate the relationship between the number of the unauthorized days that employees are absent per year, the distance (miles) between home and work for the employees and the level years of employees. A sample of 10 employees was chosen, and the following data were collected.
Distance to work (miles) | No. of Days Absent | Level years of employed |
1 | 8 | Long-Career |
3 | 5 | Mid-Career |
4 | 8 | Long-Career |
6 | 7 | Long-Career |
8 | 6 | Mid-Career |
10 | 3 | Mid-Career |
12 | 5 | new |
14 | 2 | new |
14 | 4 | Mid-Career |
18 | 2 | new |
- Develop a scatter chart for these data (consider Distance to work as independent variable and number of days absent as dependent variable). Does a linear relationship appear reasonable? Explain.
- Use the data to develop an estimated regression equation that could be used to predict the number of days absent given the distance to work and level years of employed. What is the estimated regression model?
- How much of the variation in the sample values of number of days absent does the model you estimated in part (b) explain?
Answer:
Dixie Showtime Movie Theaters, Inc., owns and operates a chain of cinemas in several markets in the southern United States. The owners would like to estimate weekly gross revenue as a function of advertising expenditures. Data for a sample of eight markets for a recent week follow:
Market | Weekly Gross Revenue ($100s) | Television Advertising
($100s) |
Newspaper Advertising
($100s) |
Mobile | 101.3 | 5.5 | 1.5 |
Shreveport | 51.9 | 3 | 3 |
Jackson | 74.8 | 4.4 | 2.5 |
Birmingham | 126.2 | 4.3 | 4.3 |
Little Rock | 137.8 | 3.6 | 4.1 |
Biloxi | 101.4 | 3.2 | 2.3 |
New Orleans | 237.8 | 5 | 8.4 |
Baton Rouge | 219.6 | 6.9 | 5.8 |
- Develop an estimated regression equation with the amount of television advertising as the independent variable. How many of the variation in the sample values of weekly gross revenue does the model explain?
- Develop an estimated regression equation with both television advertising and newspaper advertising as the independent variables. How much of the variation in the sample value of weekly gross revenue does the model explain?
(5+5 = 10 marks)
An internet provider company in Australia is interested in identifying the reason for individuals who are still undecided in buying the new NBN service of the company. The file NBN-service contains data on a sample of customers with track variables.
Create a standard partition of the data with all tracked variables and 40% of observations in the training set, 35% in the validation set, and 25% in the test set. Fit a single classification tree using contract duration(month), last plan, bonus data (GB), usage (GB), regular payment, have modem, and unlimited service as input variables and undecided as the output variable. In step 2 of XLMiner’s classification tree procedure, be sure to Normalize Input Data and to set the Minimum #records in a terminal node to 250. Generate the full tree and best pruned tree. Please note for this question you need to use the cut off 0.4.
- From the CT-Output worksheet, what is the overall error rate of the full tree on the validation set?
- Consider a 35-month contract customer who has selected plan 10 as his last plan, gifted bonus data of 50 GB, with usage of 137 GB, with regular payment, owns the modem and without unlimited service. Using the CT_PruneTree worksheet, does the best-pruned tree classify this observation as undecided?
- For the default cut-off value of 0.5, what are the overall error rate, class 1 error rate, and class 0 error rate of the best-pruned tree on the test set?
(5+10+5=20 marks)
- (Answer this question or Question 8, if must not answer both, if you answer this question I don’t mark question 8 for you).
The university of Cincinnati Center for business Analytics is an outreach center that collaborates with industry partners on applied research and continuing education in business analytics. One of the programs offered by the center is a quarterly Business Intelligence Symposium. Each symposium features three speakers on the real-world use of analytics. Each corporates member of the center (there are currently 10) receives five free seats to each symposium. Nonmembers wishing to attend must pay $75 per person. each attendee receives breakfast, lunch, and free parking. The following are the costs incurred for putting on this event:
Rental cost for the auditorium | $150 |
Registration processing | $8.50 per person |
Speaker costs | 3@$800 = $2400 |
Continental breakfast | $4.00 per person |
lunch | $7.00 per person |
parking | $5.00 per person |
- Build a spreadsheet model that calculates a profit or loss based on the number of nonmember registrants.
- Use Goal Seek to find the number of nonmember registrants that will make the event break even.
(7+8=15 marks)
- (Answer this question or Question 7, if must not answer both, if you have answered question 7 I don’t mark this question).
Great Southern is a 117-room hotel located near the Convention and Exhibition Centre in Melbourne. The Meetings & Events Australia Ltd has planned its annual exhibition in Melbourne for the last weekend in April. Great Southern has agreed to make at least one-third of its rooms available for exhibition attendees at a special exhibition rate in order to be listed as a recommended hotel for the exhibition. Although the majority of attendees at the annual exhibition typically request a Saturday night package, some of them may choose a Friday and Saturday two-night package reservation. Travellers not attending the exhibition may also request the same day reservations as well as Friday night only one. Thus, 5 kinds of reservations are likely and have the following costs and expected demands:
Ordinary (O) | Exhibition (E) | |||
Cost ($) | Demand | Cost ($) | Demand | |
Friday night only (F) | 79 | 38 | 72 | |
Saturday night only (S) | 84 | 36 | 78 | 37 |
Friday & Saturday Two-night package(FS) | 165 | 32 | 139 | 45 |
- How many rooms Great Southern can determine to make available for each type of reservation in order to maximise total profit?
- Suppose that one week before the exhibition, the number of Exhibition travellers/ Friday & Saturday two-night package rooms that were made available sell out. If an exhibition attendee calls and requests a Friday & Saturday two-night package room, should Great Southern accept this booking? If an ordinary traveller demands this option simultaneously, which one would be better to select?
Question 1
- Frequency distribution and Histogram for sampled SAT scores:
Frequency Distribution Table | |
Class Intervals (marks) | Frequency |
800-1000 | 1 |
1000-1200 | 3 |
1200-1400 | 6 |
1400-1600 | 10 |
1600-1800 | 7 |
1800-2000 | 2 |
2000-2200 | 1 |
2200-2400 | 0 |
Total | 30 |
- As observed from the plot above, the shape of the distribution is asymmetric and slightly skewed to the right (i.e. has positive skewness).
- Additional comments:
- Only a few number of students (10% of sampled students) had a SAT score above 1800.
- A major proportion of students (about 33.33%) had a SAT score within the range of 1400-1600.
- A broader, or a more common SAT score range was observed as 1200-1800 with about ¾ of the sampled students (23 out of 30 students) having a score within this range.
Question 2
- Scatter Plot:
- The trend line (also the line of best fit or the least square regression line) is plotted in the above plot (in black).
The regression equation is given as:
As observed, a negative relationship exists between the two variables (x and y), i.e. variable ‘y’ decreases with increase in variable ‘x’, and vice-versa.
Question 3
- Please refer to the Excel Spreadsheet ‘Question 3.xlsx’
- Please refer to the Excel Spreadsheet ‘Question 3.xlsx’
- An appropraite number of clusters can be obtained by taking the square root of the sample size (total number of data points) divided by two.
Sample size
Therefore,
Hence, for the given data, a clsuter size of 4 is preferred. This value , appropraitely scales the distribution (of data points); thereby resulting in least error.
Note: More precise measures can be obtained using the Elbow Method.
Question 4
- Scatter Plot:
The above plot suggests that a negative linear relationship exists between the two data variables. However, this relationship does not seem reasonable as, in general, an employee travelling greater distance to work is more likely to be absent (for obvious reasons).
- Note: For regression computations, the categorical variable ‘Level years of employed’ was coded ‘numerical values’ based on the order of level of years employed. Following dummy values are assigned:
New: 1
Mid-career: 2
Long-career : 3
Following regression equation is modeled to predict the number of days an employee is absent depending on the number of miles (or the distance) he is required to travel for office and the level years of employed:
The overall obtained model is statistically significant at a significance level of 0.05, i.e. for . However, the individual slope coefficients are not significant to this model ( -value is less than 0.05).
- As evident from the R2 (coefficient of determinations) value, the above regression model explains about 78.47% of the variation in the dependent variable ‘number of days absent’ using the predictor variables ‘distance to work’ and ‘level years of employed’.
Question 5
Please refer to the Excel file ‘Question 5.xlsx’ for regression summary outputs.
- Following regression equation is modeled to predict the weekly gross revenue with with the ‘amount of television advertising’ as the predictor (or independent) variable:
The overall obtained model is statistically not significant at a significance level of 0.05. Moreover, the derived model is a poor fit as it explains only about 43.10% variation in the dependent variable ‘weekly gross revenue’.
- Following regression equation is modeled to predict the weekly gross revenue with both television advertising and newspaper advertising as the predictor (or independent) variables:
The overall obtained model is statistically significant at a significance level of 0.05. Moreover, the derived model is a ‘very good’ fit as it explains about 90.48% of the variation in the dependent variable ‘weekly gross revenue’.
Question 6
Please refer to the Excel file ‘Question 6.xlsx’ for outputs.
Question 7
Please refer to the Excel file ‘Question 7.xlsx’.
The break-even point was identified for about 75 non-member registrations. Please refer to the Excel file for workings.
Age | Female | Income | Married | Children | CarLoan | Mortgage |
48 | 1 | 17546.00 | 0 | 1 | 0 | 0 |
40 | 0 | 30085.10 | 1 | 3 | 1 | 1 |
51 | 1 | 16575.40 | 1 | 0 | 1 | 0 |
23 | 1 | 20375.40 | 1 | 3 | 0 | 0 |
57 | 1 | 50576.30 | 1 | 0 | 0 | 0 |
57 | 1 | 37869.60 | 1 | 2 | 0 | 0 |
22 | 0 | 8877.07 | 0 | 0 | 0 | 0 |
58 | 0 | 24946.60 | 1 | 0 | 1 | 0 |
37 | 1 | 25304.30 | 1 | 2 | 1 | 0 |
54 | 0 | 24212.10 | 1 | 2 | 1 | 0 |
66 | 1 | 59803.90 | 1 | 0 | 0 | 0 |
52 | 1 | 26658.80 | 0 | 0 | 1 | 1 |
44 | 1 | 15735.80 | 1 | 1 | 0 | 1 |
66 | 1 | 55204.70 | 1 | 1 | 1 | 1 |
36 | 0 | 19474.60 | 1 | 0 | 0 | 1 |
38 | 1 | 22342.10 | 1 | 0 | 1 | 1 |
37 | 1 | 17729.80 | 1 | 2 | 0 | 1 |
46 | 1 | 41016.00 | 1 | 0 | 0 | 1 |
62 | 1 | 26909.20 | 1 | 0 | 0 | 0 |
31 | 0 | 22522.80 | 1 | 0 | 1 | 0 |
61 | 0 | 57880.70 | 1 | 2 | 0 | 0 |
50 | 0 | 16497.30 | 1 | 2 | 0 | 0 |
54 | 0 | 38446.60 | 1 | 0 | 0 | 0 |
27 | 1 | 15538.80 | 0 | 0 | 1 | 1 |
22 | 0 | 12640.30 | 0 | 2 | 1 | 0 |
56 | 0 | 41034.00 | 1 | 0 | 1 | 1 |
45 | 0 | 20809.70 | 1 | 0 | 0 | 1 |
39 | 1 | 20114.00 | 1 | 1 | 0 | 0 |
39 | 1 | 29359.10 | 0 | 3 | 1 | 1 |
61 | 0 | 24270.10 | 1 | 1 | 0 | 0 |
Workings | |
Class Intervals (marks) | Bins |
800-1000 | 1000.00 |
1000-1200 | 1200.00 |
1200-1400 | 1400.00 |
1400-1600 | 1600.00 |
1600-1800 | 1800.00 |
1800-2000 | 2000.00 |
2000-2200 | 2200.00 |
2200-2400 | 2400.00 |