Task 1
Download “data.csv” and “headerdescription.txt” from CloudDeakin. Produce an Orange loadable file based on your experience in processing the data in the last workshop.
Task 2
Using your knowledge of classification, regression and data exploration, answer the following questions
• Produce a model to predict the likely normalizedlosses of a given vehicle’s specification, risk factor and normalizedlosses.
• Produce a model that can classify a vehicle according to its risk factor. That is, given a vehicle’s specification, normalized losses and price, the classifier should produce a class label showing the likely risk factor of that vehicle.
• What are the top five attributes that best determines the risk factor? In listing the top five attributes, discuss how you determine them and supplement with appropriate Orange files if available. Your discussion should be no more than 500 words.
• Can we predict the normalized losses of “alfaromero” and “isuzu”? If no, explain why. If yes, show how you would do so. Discuss in no more than 500 words.
Solution:
Here we have to construct the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we have to use the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. The output for this regression model is given as below:
Let us see the model fitting information which is given in the following table:
Model Fitting Information 

Model 
2 Log Likelihood 
ChiSquare 
df 
Sig. 
Intercept Only 
1339.404 

Final 
1224.569 
114.834 
32 
.000 
Link function: Logit. 
We get the pvalue for this model as 0.000, that is, pvalue is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that the given model fully fitted with the ordinal regression model.
Now, let us see the test for goodness of fit.
GoodnessofFit 

ChiSquare 
df 
Sig. 

Pearson 
7483.630 
5527 
.000 
Deviance 
1167.756 
5527 
1.000 
Link function: Logit. 
The pvalue for the chi square test is given as 0.000, that is, the pvalue is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that all the variables are independent from each other.
The value for Pseudo R square or coefficient of determination is given in the following table:
Pseudo RSquare 

Cox and Snell 
.429 
Nagelkerke 
.429 
McFadden 
.080 
Link function: Logit. 
The value for coefficient of determination or R square is given as 0.429, this means about 42.9% of the variation in the dependent variable normalized losses is explained by the independent variables given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location.
Now, we have to see some classification for some variables. Let us first see the classification or frequency distribution for the variable symboling. The frequency distribution for this variable is given in the following table:
Symboling 

Frequency 
Percent 
Valid Percent 
Cumulative Percent 

Valid  2.00 
3 
1.5 
1.5 
1.5 
1.00 
22 
10.7 
10.7 
12.2 

.00 
67 
32.7 
32.7 
44.9 

1.00 
54 
26.3 
26.3 
71.2 

2.00 
32 
15.6 
15.6 
86.8 

3.00 
27 
13.2 
13.2 
100.0 

Total 
205 
100.0 
100.0 
For this frequency distribution, a value of +3 indicates that the vehicle is risky and a value of 3 means the car is pretty safe.
The value 0.00 indicates that the car is neither risky nor safe. The frequency for the neutral value 0.00 is given as 67 and this is the highest frequency than any other frequency.
Now, let us see the frequency distribution for the different types of car given below:
make 

Frequency 
Percent 
Valid Percent 
Cumulative Percent 

Valid  alfaromero 
3 
1.5 
1.5 
1.5 
audi 
7 
3.4 
3.4 
4.9 

bmw 
8 
3.9 
3.9 
8.8 

chevrolet 
3 
1.5 
1.5 
10.2 

dodge 
9 
4.4 
4.4 
14.6 

honda 
13 
6.3 
6.3 
21.0 

isuzu 
4 
2.0 
2.0 
22.9 

jaguar 
3 
1.5 
1.5 
24.4 

mazda 
17 
8.3 
8.3 
32.7 

mercedesbenz 
8 
3.9 
3.9 
36.6 

mercury 
1 
.5 
.5 
37.1 

mitsubishi 
13 
6.3 
6.3 
43.4 

nissan 
18 
8.8 
8.8 
52.2 

peugot 
11 
5.4 
5.4 
57.6 

plymouth 
7 
3.4 
3.4 
61.0 

porsche 
5 
2.4 
2.4 
63.4 

renault 
2 
1.0 
1.0 
64.4 

saab 
6 
2.9 
2.9 
67.3 

subaru 
12 
5.9 
5.9 
73.2 

toyota 
32 
15.6 
15.6 
88.8 

volkswagen 
12 
5.9 
5.9 
94.6 

volvo 
11 
5.4 
5.4 
100.0 

Total 
205 
100.0 
100.0 
Most of the percentage given for the Toyota cars and it is given as 15.6%.
The classification for the fuel type of the car is given in the following table:
fueltype 

Frequency 
Percent 
Valid Percent 
Cumulative Percent 

Valid  diesel 
20 
9.8 
9.8 
9.8 
gas 
185 
90.2 
90.2 
100.0 

Total 
205 
100.0 
100.0 
The table for the frequency distribution or classification of the variable aspiration is given as below:
aspiration 

Frequency 
Percent 
Valid Percent 
Cumulative Percent 

Valid  std 
168 
82.0 
82.0 
82.0 
turbo 
37 
18.0 
18.0 
100.0 

Total 
205 
100.0 
100.0 
The frequency distribution for the number of doors for the car is given in the following table:
num_of_doors 

Frequency 
Percent 
Valid Percent 
Cumulative Percent 

Valid  ? 
2 
1.0 
1.0 
1.0 
four 
114 
55.6 
55.6 
56.6 

two 
89 
43.4 
43.4 
100.0 

Total 
205 
100.0 
100.0 
The frequency distribution for the body style of the car is given in the following table:
body_style 

Frequency 
Percent 
Valid Percent 
Cumulative Percent 

Valid  convertible 
6 
2.9 
2.9 
2.9 
hardtop 
8 
3.9 
3.9 
6.8 

hatchback 
70 
34.1 
34.1 
41.0 

sedan 
96 
46.8 
46.8 
87.8 

wagon 
25 
12.2 
12.2 
100.0 

Total 
205 
100.0 
100.0 
The frequency distribution for the variable drive wheels is given in the following table:
drive_wheels 

Frequency 
Percent 
Valid Percent 
Cumulative Percent 

Valid  4wd 
9 
4.4 
4.4 
4.4 
fwd 
120 
58.5 
58.5 
62.9 

rwd 
76 
37.1 
37.1 
100.0 

Total 
205 
100.0 
100.0 
The classification for the engine location is given in the following table:
engine_location 

Frequency 
Percent 
Valid Percent 
Cumulative Percent 

Valid  front 
202 
98.5 
98.5 
98.5 
rear 
3 
1.5 
1.5 
100.0 

Total 
205 
100.0 
100.0 
For the 202 cars, the engine location is placed at front side and for only 3 cars; the engine location is placed at rear side.
Now, we have to see some cross tabulations.
The cross tabulation for the variables symboling and the type of fuel is given in the following table:
symboling * fueltype Crosstabulation 

Count  
fueltype 
Total 

diesel 
gas 

symboling  2.00 
0 
3 
3 
1.00 
5 
17 
22 

.00 
11 
56 
67 

1.00 
1 
53 
54 

2.00 
3 
29 
32 

3.00 
0 
27 
27 

Total 
20 
185 
205 
The cross tabulation for the variables symboling and the number of doors for the cars is given in the following table:
symboling * num_of_doors Crosstabulation 

Count  
num_of_doors 
Total 

? 
four 
two 

symboling  2.00 
0 
3 
0 
3 
1.00 
0 
22 
0 
22 

.00 
1 
59 
7 
67 

1.00 
1 
20 
33 
54 

2.00 
0 
10 
22 
32 

3.00 
0 
0 
27 
27 

Total 
2 
114 
89 
205 
The cross tabulation for the variables symboling and the body style of the car is given in the following table:
symboling * body_style Crosstabulation 

Count  
body_style 
Total 

convertible 
hardtop 
hatchback 
sedan 
wagon 

symboling  2.00 
0 
0 
0 
3 
0 
3 
1.00 
0 
0 
2 
13 
7 
22 

.00 
0 
1 
8 
43 
15 
67 

1.00 
0 
1 
27 
23 
3 
54 

2.00 
1 
4 
13 
14 
0 
32 

3.00 
5 
2 
20 
0 
0 
27 

Total 
6 
8 
70 
96 
25 
205 
The cross tabulation for the variables type of fuel and the body style of the car is given in the following table:
fueltype * body_style Crosstabulation 

Count  
body_style 
Total 

convertible 
hardtop 
hatchback 
sedan 
wagon 

fueltype  diesel 
0 
1 
1 
15 
3 
20 
gas 
6 
7 
69 
81 
22 
185 

Total 
6 
8 
70 
96 
25 
205 
Task 3
Discuss how you would ensure that the model you produced in Task 2 are reliable and accurate.
Here we constructed the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we used the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. We used this model because normalised losses are given in the some type of order or we can say that there is specific range or order given for dependent variable. So we used here the ordinal regression model.
Task 4
As noted in the introduction, there are missing values in the data set. Discuss what you would do with these missing values. Do you remove them, attempt to provide values to these unknowns, or attempt a combination of different techniques? Discuss in no more than 500 words.
We observed the missing values for some variables given in the data set. These missing values become a problem when we analyse the data for the further estimation or planning. When we do not know any information about these missing values or if we do not know about the pattern for these missing values, then we not consider these missing values in the analysis. We remove this missing value at the time of analysis or classification because if the missing values includes, then we do not get the accurate estimates or values for our results.
References:
 Cohen, J., and Cohen, P. (1975). Applied Multiple Regression and Correlation Analysis for the Behavioral Sciences, Hillsdale, New Jersey: Lawrence Erlbaum Associates.
 Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: A second course in statistics. Reading, MA: AddisonWesley.
 Chatterjee, S. and Price, B. Regression Analysis by Example. Wiley, New York, 1977.