Task 1
Download “data.csv” and “header-description.txt” from CloudDeakin. Produce an Orange loadable file based on your experience in processing the data in the last workshop.
Task 2
Using your knowledge of classification, regression and data exploration, answer the following questions
• Produce a model to predict the likely normalized-losses of a given vehicle’s specification, risk factor and normalized-losses.
• Produce a model that can classify a vehicle according to its risk factor. That is, given a vehicle’s specification, normalized losses and price, the classifier should produce a class label showing the likely risk factor of that vehicle.
• What are the top five attributes that best determines the risk factor? In listing the top five attributes, discuss how you determine them and supplement with appropriate Orange files if available. Your discussion should be no more than 500 words.
• Can we predict the normalized losses of “alfa-romero” and “isuzu”? If no, explain why. If yes, show how you would do so. Discuss in no more than 500 words.
Solution:
Here we have to construct the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we have to use the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. The output for this regression model is given as below:
Let us see the model fitting information which is given in the following table:
Model Fitting Information |
||||
Model |
-2 Log Likelihood |
Chi-Square |
df |
Sig. |
Intercept Only |
1339.404 |
|||
Final |
1224.569 |
114.834 |
32 |
.000 |
Link function: Logit. |
We get the p-value for this model as 0.000, that is, p-value is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that the given model fully fitted with the ordinal regression model.
Now, let us see the test for goodness of fit.
Goodness-of-Fit |
|||
Chi-Square |
df |
Sig. |
|
Pearson |
7483.630 |
5527 |
.000 |
Deviance |
1167.756 |
5527 |
1.000 |
Link function: Logit. |
The p-value for the chi square test is given as 0.000, that is, the p-value is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that all the variables are independent from each other.
The value for Pseudo R square or coefficient of determination is given in the following table:
Pseudo R-Square |
|
Cox and Snell |
.429 |
Nagelkerke |
.429 |
McFadden |
.080 |
Link function: Logit. |
The value for coefficient of determination or R square is given as 0.429, this means about 42.9% of the variation in the dependent variable normalized losses is explained by the independent variables given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location.
Now, we have to see some classification for some variables. Let us first see the classification or frequency distribution for the variable symboling. The frequency distribution for this variable is given in the following table:
Symboling |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid | -2.00 |
3 |
1.5 |
1.5 |
1.5 |
-1.00 |
22 |
10.7 |
10.7 |
12.2 |
|
.00 |
67 |
32.7 |
32.7 |
44.9 |
|
1.00 |
54 |
26.3 |
26.3 |
71.2 |
|
2.00 |
32 |
15.6 |
15.6 |
86.8 |
|
3.00 |
27 |
13.2 |
13.2 |
100.0 |
|
Total |
205 |
100.0 |
100.0 |
For this frequency distribution, a value of +3 indicates that the vehicle is risky and a value of -3 means the car is pretty safe.
The value 0.00 indicates that the car is neither risky nor safe. The frequency for the neutral value 0.00 is given as 67 and this is the highest frequency than any other frequency.
Now, let us see the frequency distribution for the different types of car given below:
make |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid | alfa-romero |
3 |
1.5 |
1.5 |
1.5 |
audi |
7 |
3.4 |
3.4 |
4.9 |
|
bmw |
8 |
3.9 |
3.9 |
8.8 |
|
chevrolet |
3 |
1.5 |
1.5 |
10.2 |
|
dodge |
9 |
4.4 |
4.4 |
14.6 |
|
honda |
13 |
6.3 |
6.3 |
21.0 |
|
isuzu |
4 |
2.0 |
2.0 |
22.9 |
|
jaguar |
3 |
1.5 |
1.5 |
24.4 |
|
mazda |
17 |
8.3 |
8.3 |
32.7 |
|
mercedes-benz |
8 |
3.9 |
3.9 |
36.6 |
|
mercury |
1 |
.5 |
.5 |
37.1 |
|
mitsubishi |
13 |
6.3 |
6.3 |
43.4 |
|
nissan |
18 |
8.8 |
8.8 |
52.2 |
|
peugot |
11 |
5.4 |
5.4 |
57.6 |
|
plymouth |
7 |
3.4 |
3.4 |
61.0 |
|
porsche |
5 |
2.4 |
2.4 |
63.4 |
|
renault |
2 |
1.0 |
1.0 |
64.4 |
|
saab |
6 |
2.9 |
2.9 |
67.3 |
|
subaru |
12 |
5.9 |
5.9 |
73.2 |
|
toyota |
32 |
15.6 |
15.6 |
88.8 |
|
volkswagen |
12 |
5.9 |
5.9 |
94.6 |
|
volvo |
11 |
5.4 |
5.4 |
100.0 |
|
Total |
205 |
100.0 |
100.0 |
Most of the percentage given for the Toyota cars and it is given as 15.6%.
The classification for the fuel type of the car is given in the following table:
fueltype |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid | diesel |
20 |
9.8 |
9.8 |
9.8 |
gas |
185 |
90.2 |
90.2 |
100.0 |
|
Total |
205 |
100.0 |
100.0 |
The table for the frequency distribution or classification of the variable aspiration is given as below:
aspiration |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid | std |
168 |
82.0 |
82.0 |
82.0 |
turbo |
37 |
18.0 |
18.0 |
100.0 |
|
Total |
205 |
100.0 |
100.0 |
The frequency distribution for the number of doors for the car is given in the following table:
num_of_doors |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid | ? |
2 |
1.0 |
1.0 |
1.0 |
four |
114 |
55.6 |
55.6 |
56.6 |
|
two |
89 |
43.4 |
43.4 |
100.0 |
|
Total |
205 |
100.0 |
100.0 |
The frequency distribution for the body style of the car is given in the following table:
body_style |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid | convertible |
6 |
2.9 |
2.9 |
2.9 |
hardtop |
8 |
3.9 |
3.9 |
6.8 |
|
hatchback |
70 |
34.1 |
34.1 |
41.0 |
|
sedan |
96 |
46.8 |
46.8 |
87.8 |
|
wagon |
25 |
12.2 |
12.2 |
100.0 |
|
Total |
205 |
100.0 |
100.0 |
The frequency distribution for the variable drive wheels is given in the following table:
drive_wheels |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid | 4wd |
9 |
4.4 |
4.4 |
4.4 |
fwd |
120 |
58.5 |
58.5 |
62.9 |
|
rwd |
76 |
37.1 |
37.1 |
100.0 |
|
Total |
205 |
100.0 |
100.0 |
The classification for the engine location is given in the following table:
engine_location |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid | front |
202 |
98.5 |
98.5 |
98.5 |
rear |
3 |
1.5 |
1.5 |
100.0 |
|
Total |
205 |
100.0 |
100.0 |
For the 202 cars, the engine location is placed at front side and for only 3 cars; the engine location is placed at rear side.
Now, we have to see some cross tabulations.
The cross tabulation for the variables symboling and the type of fuel is given in the following table:
symboling * fueltype Crosstabulation |
||||
Count | ||||
fueltype |
Total |
|||
diesel |
gas |
|||
symboling | -2.00 |
0 |
3 |
3 |
-1.00 |
5 |
17 |
22 |
|
.00 |
11 |
56 |
67 |
|
1.00 |
1 |
53 |
54 |
|
2.00 |
3 |
29 |
32 |
|
3.00 |
0 |
27 |
27 |
|
Total |
20 |
185 |
205 |
The cross tabulation for the variables symboling and the number of doors for the cars is given in the following table:
symboling * num_of_doors Crosstabulation |
|||||
Count | |||||
num_of_doors |
Total |
||||
? |
four |
two |
|||
symboling | -2.00 |
0 |
3 |
0 |
3 |
-1.00 |
0 |
22 |
0 |
22 |
|
.00 |
1 |
59 |
7 |
67 |
|
1.00 |
1 |
20 |
33 |
54 |
|
2.00 |
0 |
10 |
22 |
32 |
|
3.00 |
0 |
0 |
27 |
27 |
|
Total |
2 |
114 |
89 |
205 |
The cross tabulation for the variables symboling and the body style of the car is given in the following table:
symboling * body_style Crosstabulation |
|||||||
Count | |||||||
body_style |
Total |
||||||
convertible |
hardtop |
hatchback |
sedan |
wagon |
|||
symboling | -2.00 |
0 |
0 |
0 |
3 |
0 |
3 |
-1.00 |
0 |
0 |
2 |
13 |
7 |
22 |
|
.00 |
0 |
1 |
8 |
43 |
15 |
67 |
|
1.00 |
0 |
1 |
27 |
23 |
3 |
54 |
|
2.00 |
1 |
4 |
13 |
14 |
0 |
32 |
|
3.00 |
5 |
2 |
20 |
0 |
0 |
27 |
|
Total |
6 |
8 |
70 |
96 |
25 |
205 |
The cross tabulation for the variables type of fuel and the body style of the car is given in the following table:
fueltype * body_style Crosstabulation |
|||||||
Count | |||||||
body_style |
Total |
||||||
convertible |
hardtop |
hatchback |
sedan |
wagon |
|||
fueltype | diesel |
0 |
1 |
1 |
15 |
3 |
20 |
gas |
6 |
7 |
69 |
81 |
22 |
185 |
|
Total |
6 |
8 |
70 |
96 |
25 |
205 |
Task 3
Discuss how you would ensure that the model you produced in Task 2 are reliable and accurate.
Here we constructed the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we used the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. We used this model because normalised losses are given in the some type of order or we can say that there is specific range or order given for dependent variable. So we used here the ordinal regression model.
Task 4
As noted in the introduction, there are missing values in the data set. Discuss what you would do with these missing values. Do you remove them, attempt to provide values to these unknowns, or attempt a combination of different techniques? Discuss in no more than 500 words.
We observed the missing values for some variables given in the data set. These missing values become a problem when we analyse the data for the further estimation or planning. When we do not know any information about these missing values or if we do not know about the pattern for these missing values, then we not consider these missing values in the analysis. We remove this missing value at the time of analysis or classification because if the missing values includes, then we do not get the accurate estimates or values for our results.
References:
- Cohen, J., and Cohen, P. (1975). Applied Multiple Regression and Correlation Analysis for the Behavioral Sciences, Hillsdale, New Jersey: Lawrence Erlbaum Associates.
- Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: A second course in statistics. Reading, MA: Addison-Wesley.
- Chatterjee, S. and Price, B. Regression Analysis by Example. Wiley, New York, 1977.