Enterprise Information Management-54259

Task 1

Download “data.csv” and “header-description.txt” from CloudDeakin. Produce an Orange loadable file based on your experience in processing the data in the last workshop.

Task 2

Using your knowledge of classification, regression and data exploration, answer the following questions

• Produce a model to predict the likely normalized-losses of a given vehicle’s specification, risk factor and normalized-losses.

• Produce a model that can classify a vehicle according to its risk factor. That is, given a vehicle’s specification, normalized losses and price, the classifier should produce a class label showing the likely risk factor of that vehicle.

• What are the top five attributes that best determines the risk factor? In listing the top five attributes, discuss how you determine them and supplement with appropriate Orange files if available. Your discussion should be no more than 500 words.

• Can we predict the normalized losses of “alfa-romero” and “isuzu”? If no, explain why. If yes, show how you would do so. Discuss in no more than 500 words.

Solution:

Here we have to construct the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we have to use the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. The output for this regression model is given as below:

Let us see the model fitting information which is given in the following table:

Model Fitting Information

Model

-2 Log Likelihood

Chi-Square

df

Sig.

Intercept Only

1339.404

     
Final

1224.569

114.834

32

.000

Link function: Logit.

We get the p-value for this model as 0.000, that is, p-value is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that the given model fully fitted with the ordinal regression model.

Now, let us see the test for goodness of fit.

Goodness-of-Fit

 

Chi-Square

df

Sig.

Pearson

7483.630

5527

.000

Deviance

1167.756

5527

1.000

Link function: Logit.

The p-value for the chi square test is given as 0.000, that is, the p-value is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that all the variables are independent from each other.

The value for Pseudo R square or coefficient of determination is given in the following table:

Pseudo R-Square

Cox and Snell

.429

Nagelkerke

.429

McFadden

.080

Link function: Logit.

The value for coefficient of determination or R square is given as 0.429, this means about 42.9% of the variation in the dependent variable normalized losses is explained by the independent variables given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location.

Now, we have to see some classification for some variables. Let us first see the classification or frequency distribution for the variable symboling. The frequency distribution for this variable is given in the following table:

Symboling

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid -2.00

3

1.5

1.5

1.5

-1.00

22

10.7

10.7

12.2

.00

67

32.7

32.7

44.9

1.00

54

26.3

26.3

71.2

2.00

32

15.6

15.6

86.8

3.00

27

13.2

13.2

100.0

Total

205

100.0

100.0

 

For this frequency distribution, a value of +3 indicates that the vehicle is risky and a value of -3 means the car is pretty safe.

The value 0.00 indicates that the car is neither risky nor safe. The frequency for the neutral value 0.00 is given as 67 and this is the highest frequency than any other frequency.

Now, let us see the frequency distribution for the different types of car given below:

make

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid alfa-romero

3

1.5

1.5

1.5

audi

7

3.4

3.4

4.9

bmw

8

3.9

3.9

8.8

chevrolet

3

1.5

1.5

10.2

dodge

9

4.4

4.4

14.6

honda

13

6.3

6.3

21.0

isuzu

4

2.0

2.0

22.9

jaguar

3

1.5

1.5

24.4

mazda

17

8.3

8.3

32.7

mercedes-benz

8

3.9

3.9

36.6

mercury

1

.5

.5

37.1

mitsubishi

13

6.3

6.3

43.4

nissan

18

8.8

8.8

52.2

peugot

11

5.4

5.4

57.6

plymouth

7

3.4

3.4

61.0

porsche

5

2.4

2.4

63.4

renault

2

1.0

1.0

64.4

saab

6

2.9

2.9

67.3

subaru

12

5.9

5.9

73.2

toyota

32

15.6

15.6

88.8

volkswagen

12

5.9

5.9

94.6

volvo

11

5.4

5.4

100.0

Total

205

100.0

100.0

 

Most of the percentage given for the Toyota cars and it is given as 15.6%.

The classification for the fuel type of the car is given in the following table:

fueltype

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid diesel

20

9.8

9.8

9.8

gas

185

90.2

90.2

100.0

Total

205

100.0

100.0

 

The table for the frequency distribution or classification of the variable aspiration is given as below:

aspiration

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid std

168

82.0

82.0

82.0

turbo

37

18.0

18.0

100.0

Total

205

100.0

100.0

 

The frequency distribution for the number of doors for the car is given in the following table:

num_of_doors

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid ?

2

1.0

1.0

1.0

four

114

55.6

55.6

56.6

two

89

43.4

43.4

100.0

Total

205

100.0

100.0

 

The frequency distribution for the body style of the car is given in the following table:

body_style

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid convertible

6

2.9

2.9

2.9

hardtop

8

3.9

3.9

6.8

hatchback

70

34.1

34.1

41.0

sedan

96

46.8

46.8

87.8

wagon

25

12.2

12.2

100.0

Total

205

100.0

100.0

 

The frequency distribution for the variable drive wheels is given in the following table:

drive_wheels

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid 4wd

9

4.4

4.4

4.4

fwd

120

58.5

58.5

62.9

rwd

76

37.1

37.1

100.0

Total

205

100.0

100.0

 

The classification for the engine location is given in the following table:

engine_location

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid front

202

98.5

98.5

98.5

rear

3

1.5

1.5

100.0

Total

205

100.0

100.0

 

For the 202 cars, the engine location is placed at front side and for only 3 cars; the engine location is placed at rear side.

Now, we have to see some cross tabulations.

The cross tabulation for the variables symboling and the type of fuel is given in the following table:

symboling * fueltype Crosstabulation

Count
 

fueltype

Total

diesel

gas

symboling -2.00

0

3

3

-1.00

5

17

22

.00

11

56

67

1.00

1

53

54

2.00

3

29

32

3.00

0

27

27

Total

20

185

205

The cross tabulation for the variables symboling and the number of doors for the cars is given in the following table:

symboling * num_of_doors Crosstabulation

Count
 

num_of_doors

Total

?

four

two

symboling -2.00

0

3

0

3

-1.00

0

22

0

22

.00

1

59

7

67

1.00

1

20

33

54

2.00

0

10

22

32

3.00

0

0

27

27

Total

2

114

89

205

The cross tabulation for the variables symboling and the body style of the car is given in the following table:

symboling * body_style Crosstabulation

Count
 

body_style

Total

convertible

hardtop

hatchback

sedan

wagon

symboling -2.00

0

0

0

3

0

3

-1.00

0

0

2

13

7

22

.00

0

1

8

43

15

67

1.00

0

1

27

23

3

54

2.00

1

4

13

14

0

32

3.00

5

2

20

0

0

27

Total

6

8

70

96

25

205

The cross tabulation for the variables type of fuel and the body style of the car is given in the following table:

fueltype * body_style Crosstabulation

Count
 

body_style

Total

convertible

hardtop

hatchback

sedan

wagon

fueltype diesel

0

1

1

15

3

20

gas

6

7

69

81

22

185

Total

6

8

70

96

25

205

Task 3

Discuss how you would ensure that the model you produced in Task 2 are reliable and accurate.

Here we constructed the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we used the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. We used this model because normalised losses are given in the some type of order or we can say that there is specific range or order given for dependent variable. So we used here the ordinal regression model.

Task 4

As noted in the introduction, there are missing values in the data set. Discuss what you would do with these missing values. Do you remove them, attempt to provide values to these unknowns, or attempt a combination of different techniques? Discuss in no more than 500 words.

We observed the missing values for some variables given in the data set. These missing values become a problem when we analyse the data for the further estimation or planning. When we do not know any information about these missing values or if we do not know about the pattern for these missing values, then we not consider these missing values in the analysis. We remove this missing value at the time of analysis or classification because if the missing values includes, then we do not get the accurate estimates or values for our results.

References:

  • Cohen, J., and Cohen, P. (1975). Applied Multiple Regression and Correlation Analysis for the Behavioral Sciences, Hillsdale, New Jersey: Lawrence Erlbaum Associates.
  • Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: A second course in statistics. Reading, MA: Addison-Wesley.
  • Chatterjee, S. and Price, B. Regression Analysis by Example. Wiley, New York, 1977.