# Enterprise Information Management-54259

Using your knowledge of classification, regression and data exploration, answer the following questions

• Produce a model to predict the likely normalized-losses of a given vehicle’s specification, risk factor and normalized-losses.

• Produce a model that can classify a vehicle according to its risk factor. That is, given a vehicle’s specification, normalized losses and price, the classifier should produce a class label showing the likely risk factor of that vehicle.

• What are the top five attributes that best determines the risk factor? In listing the top five attributes, discuss how you determine them and supplement with appropriate Orange files if available. Your discussion should be no more than 500 words.

• Can we predict the normalized losses of “alfa-romero” and “isuzu”? If no, explain why. If yes, show how you would do so. Discuss in no more than 500 words.

Solution:

Here we have to construct the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we have to use the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. The output for this regression model is given as below:

Let us see the model fitting information which is given in the following table:

 Model Fitting Information Model -2 Log Likelihood Chi-Square df Sig. Intercept Only 1339.404 Final 1224.569 114.834 32 .000 Link function: Logit.

We get the p-value for this model as 0.000, that is, p-value is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that the given model fully fitted with the ordinal regression model.

Now, let us see the test for goodness of fit.

 Goodness-of-Fit Chi-Square df Sig. Pearson 7483.630 5527 .000 Deviance 1167.756 5527 1.000 Link function: Logit.

The p-value for the chi square test is given as 0.000, that is, the p-value is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that all the variables are independent from each other.

The value for Pseudo R square or coefficient of determination is given in the following table:

 Pseudo R-Square Cox and Snell .429 Nagelkerke .429 McFadden .080 Link function: Logit.

The value for coefficient of determination or R square is given as 0.429, this means about 42.9% of the variation in the dependent variable normalized losses is explained by the independent variables given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location.

Now, we have to see some classification for some variables. Let us first see the classification or frequency distribution for the variable symboling. The frequency distribution for this variable is given in the following table:

 Symboling Frequency Percent Valid Percent Cumulative Percent Valid -2.00 3 1.5 1.5 1.5 -1.00 22 10.7 10.7 12.2 .00 67 32.7 32.7 44.9 1.00 54 26.3 26.3 71.2 2.00 32 15.6 15.6 86.8 3.00 27 13.2 13.2 100.0 Total 205 100.0 100.0

For this frequency distribution, a value of +3 indicates that the vehicle is risky and a value of -3 means the car is pretty safe.

The value 0.00 indicates that the car is neither risky nor safe. The frequency for the neutral value 0.00 is given as 67 and this is the highest frequency than any other frequency.

Now, let us see the frequency distribution for the different types of car given below:

 make Frequency Percent Valid Percent Cumulative Percent Valid alfa-romero 3 1.5 1.5 1.5 audi 7 3.4 3.4 4.9 bmw 8 3.9 3.9 8.8 chevrolet 3 1.5 1.5 10.2 dodge 9 4.4 4.4 14.6 honda 13 6.3 6.3 21.0 isuzu 4 2.0 2.0 22.9 jaguar 3 1.5 1.5 24.4 mazda 17 8.3 8.3 32.7 mercedes-benz 8 3.9 3.9 36.6 mercury 1 .5 .5 37.1 mitsubishi 13 6.3 6.3 43.4 nissan 18 8.8 8.8 52.2 peugot 11 5.4 5.4 57.6 plymouth 7 3.4 3.4 61.0 porsche 5 2.4 2.4 63.4 renault 2 1.0 1.0 64.4 saab 6 2.9 2.9 67.3 subaru 12 5.9 5.9 73.2 toyota 32 15.6 15.6 88.8 volkswagen 12 5.9 5.9 94.6 volvo 11 5.4 5.4 100.0 Total 205 100.0 100.0

Most of the percentage given for the Toyota cars and it is given as 15.6%.

The classification for the fuel type of the car is given in the following table:

 fueltype Frequency Percent Valid Percent Cumulative Percent Valid diesel 20 9.8 9.8 9.8 gas 185 90.2 90.2 100.0 Total 205 100.0 100.0

The table for the frequency distribution or classification of the variable aspiration is given as below:

 aspiration Frequency Percent Valid Percent Cumulative Percent Valid std 168 82.0 82.0 82.0 turbo 37 18.0 18.0 100.0 Total 205 100.0 100.0

The frequency distribution for the number of doors for the car is given in the following table:

 num_of_doors Frequency Percent Valid Percent Cumulative Percent Valid ? 2 1.0 1.0 1.0 four 114 55.6 55.6 56.6 two 89 43.4 43.4 100.0 Total 205 100.0 100.0

The frequency distribution for the body style of the car is given in the following table:

 body_style Frequency Percent Valid Percent Cumulative Percent Valid convertible 6 2.9 2.9 2.9 hardtop 8 3.9 3.9 6.8 hatchback 70 34.1 34.1 41.0 sedan 96 46.8 46.8 87.8 wagon 25 12.2 12.2 100.0 Total 205 100.0 100.0

The frequency distribution for the variable drive wheels is given in the following table:

 drive_wheels Frequency Percent Valid Percent Cumulative Percent Valid 4wd 9 4.4 4.4 4.4 fwd 120 58.5 58.5 62.9 rwd 76 37.1 37.1 100.0 Total 205 100.0 100.0

The classification for the engine location is given in the following table:

 engine_location Frequency Percent Valid Percent Cumulative Percent Valid front 202 98.5 98.5 98.5 rear 3 1.5 1.5 100.0 Total 205 100.0 100.0

For the 202 cars, the engine location is placed at front side and for only 3 cars; the engine location is placed at rear side.

Now, we have to see some cross tabulations.

The cross tabulation for the variables symboling and the type of fuel is given in the following table:

 symboling * fueltype Crosstabulation Count fueltype Total diesel gas symboling -2.00 0 3 3 -1.00 5 17 22 .00 11 56 67 1.00 1 53 54 2.00 3 29 32 3.00 0 27 27 Total 20 185 205

The cross tabulation for the variables symboling and the number of doors for the cars is given in the following table:

 symboling * num_of_doors Crosstabulation Count num_of_doors Total ? four two symboling -2.00 0 3 0 3 -1.00 0 22 0 22 .00 1 59 7 67 1.00 1 20 33 54 2.00 0 10 22 32 3.00 0 0 27 27 Total 2 114 89 205

The cross tabulation for the variables symboling and the body style of the car is given in the following table:

 symboling * body_style Crosstabulation Count body_style Total convertible hardtop hatchback sedan wagon symboling -2.00 0 0 0 3 0 3 -1.00 0 0 2 13 7 22 .00 0 1 8 43 15 67 1.00 0 1 27 23 3 54 2.00 1 4 13 14 0 32 3.00 5 2 20 0 0 27 Total 6 8 70 96 25 205

The cross tabulation for the variables type of fuel and the body style of the car is given in the following table:

 fueltype * body_style Crosstabulation Count body_style Total convertible hardtop hatchback sedan wagon fueltype diesel 0 1 1 15 3 20 gas 6 7 69 81 22 185 Total 6 8 70 96 25 205

Discuss how you would ensure that the model you produced in Task 2 are reliable and accurate.

Here we constructed the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we used the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. We used this model because normalised losses are given in the some type of order or we can say that there is specific range or order given for dependent variable. So we used here the ordinal regression model.

As noted in the introduction, there are missing values in the data set. Discuss what you would do with these missing values. Do you remove them, attempt to provide values to these unknowns, or attempt a combination of different techniques? Discuss in no more than 500 words.

We observed the missing values for some variables given in the data set. These missing values become a problem when we analyse the data for the further estimation or planning. When we do not know any information about these missing values or if we do not know about the pattern for these missing values, then we not consider these missing values in the analysis. We remove this missing value at the time of analysis or classification because if the missing values includes, then we do not get the accurate estimates or values for our results.

References:

• Cohen, J., and Cohen, P. (1975). Applied Multiple Regression and Correlation Analysis for the Behavioral Sciences, Hillsdale, New Jersey: Lawrence Erlbaum Associates.
• Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: A second course in statistics. Reading, MA: Addison-Wesley.
• Chatterjee, S. and Price, B. Regression Analysis by Example. Wiley, New York, 1977.