Enterprise Information Management-54259 – My Assignment Help : Samples & Case Study Review Sample

Task 1

Download “data.csv” and “header-description.txt” from CloudDeakin. Produce an Orange loadable file based on your experience in processing the data in the last workshop.

Task 2

Using your knowledge of classification, regression and data exploration, answer the following questions

• Produce a model to predict the likely normalized-losses of a given vehicle’s specification, risk factor and normalized-losses.

• Produce a model that can classify a vehicle according to its risk factor. That is, given a vehicle’s specification, normalized losses and price, the classifier should produce a class label showing the likely risk factor of that vehicle.

• What are the top five attributes that best determines the risk factor? In listing the top five attributes, discuss how you determine them and supplement with appropriate Orange files if available. Your discussion should be no more than 500 words.

• Can we predict the normalized losses of “alfa-romero” and “isuzu”? If no, explain why. If yes, show how you would do so. Discuss in no more than 500 words.

Solution:

Here we have to construct the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we have to use the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. The output for this regression model is given as below:

Let us see the model fitting information which is given in the following table:

Model Fitting Information
Model	-2 Log Likelihood	Chi-Square	df	Sig.
Intercept Only	1339.404
Final	1224.569	114.834	32	.000
Link function: Logit.

We get the p-value for this model as 0.000, that is, p-value is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that the given model fully fitted with the ordinal regression model.

Now, let us see the test for goodness of fit.

Goodness-of-Fit
	Chi-Square	df	Sig.
Pearson	7483.630	5527	.000
Deviance	1167.756	5527	1.000
Link function: Logit.

The p-value for the chi square test is given as 0.000, that is, the p-value is less than the given level of significance or alpha value 0.05, so we reject the null hypothesis that all the variables are independent from each other.

The value for Pseudo R square or coefficient of determination is given in the following table:

Pseudo R-Square
Cox and Snell	.429
Nagelkerke	.429
McFadden	.080
Link function: Logit.

The value for coefficient of determination or R square is given as 0.429, this means about 42.9% of the variation in the dependent variable normalized losses is explained by the independent variables given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location.

Now, we have to see some classification for some variables. Let us first see the classification or frequency distribution for the variable symboling. The frequency distribution for this variable is given in the following table:

Symboling
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	-2.00	3	1.5	1.5	1.5
	-1.00	22	10.7	10.7	12.2
	.00	67	32.7	32.7	44.9
	1.00	54	26.3	26.3	71.2
	2.00	32	15.6	15.6	86.8
	3.00	27	13.2	13.2	100.0
	Total	205	100.0	100.0

For this frequency distribution, a value of +3 indicates that the vehicle is risky and a value of -3 means the car is pretty safe.

The value 0.00 indicates that the car is neither risky nor safe. The frequency for the neutral value 0.00 is given as 67 and this is the highest frequency than any other frequency.

Now, let us see the frequency distribution for the different types of car given below:

make
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	alfa-romero	3	1.5	1.5	1.5
	audi	7	3.4	3.4	4.9
	bmw	8	3.9	3.9	8.8
	chevrolet	3	1.5	1.5	10.2
	dodge	9	4.4	4.4	14.6
	honda	13	6.3	6.3	21.0
	isuzu	4	2.0	2.0	22.9
	jaguar	3	1.5	1.5	24.4
	mazda	17	8.3	8.3	32.7
	mercedes-benz	8	3.9	3.9	36.6
	mercury	1	.5	.5	37.1
	mitsubishi	13	6.3	6.3	43.4
	nissan	18	8.8	8.8	52.2
	peugot	11	5.4	5.4	57.6
	plymouth	7	3.4	3.4	61.0
	porsche	5	2.4	2.4	63.4
	renault	2	1.0	1.0	64.4
	saab	6	2.9	2.9	67.3
	subaru	12	5.9	5.9	73.2
	toyota	32	15.6	15.6	88.8
	volkswagen	12	5.9	5.9	94.6
	volvo	11	5.4	5.4	100.0
	Total	205	100.0	100.0

Most of the percentage given for the Toyota cars and it is given as 15.6%.

The classification for the fuel type of the car is given in the following table:

fueltype
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	diesel	20	9.8	9.8	9.8
	gas	185	90.2	90.2	100.0
	Total	205	100.0	100.0

The table for the frequency distribution or classification of the variable aspiration is given as below:

aspiration
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	std	168	82.0	82.0	82.0
	turbo	37	18.0	18.0	100.0
	Total	205	100.0	100.0

The frequency distribution for the number of doors for the car is given in the following table:

num_of_doors
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	?	2	1.0	1.0	1.0
	four	114	55.6	55.6	56.6
	two	89	43.4	43.4	100.0
	Total	205	100.0	100.0

The frequency distribution for the body style of the car is given in the following table:

body_style
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	convertible	6	2.9	2.9	2.9
	hardtop	8	3.9	3.9	6.8
	hatchback	70	34.1	34.1	41.0
	sedan	96	46.8	46.8	87.8
	wagon	25	12.2	12.2	100.0
	Total	205	100.0	100.0

The frequency distribution for the variable drive wheels is given in the following table:

drive_wheels
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	4wd	9	4.4	4.4	4.4
	fwd	120	58.5	58.5	62.9
	rwd	76	37.1	37.1	100.0
	Total	205	100.0	100.0

The classification for the engine location is given in the following table:

engine_location
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	front	202	98.5	98.5	98.5
	rear	3	1.5	1.5	100.0
	Total	205	100.0	100.0

For the 202 cars, the engine location is placed at front side and for only 3 cars; the engine location is placed at rear side.

Now, we have to see some cross tabulations.

The cross tabulation for the variables symboling and the type of fuel is given in the following table:

*symboling fueltype Crosstabulation**
Count
		fueltype		Total
		diesel	gas	Total
symboling	-2.00	0	3	3
	-1.00	5	17	22
	.00	11	56	67
	1.00	1	53	54
	2.00	3	29	32
	3.00	0	27	27
Total		20	185	205

The cross tabulation for the variables symboling and the number of doors for the cars is given in the following table:

*symboling num_of_doors Crosstabulation**
Count
		num_of_doors			Total
		?	four	two	Total
symboling	-2.00	0	3	0	3
	-1.00	0	22	0	22
	.00	1	59	7	67
	1.00	1	20	33	54
	2.00	0	10	22	32
	3.00	0	0	27	27
Total		2	114	89	205

The cross tabulation for the variables symboling and the body style of the car is given in the following table:

*symboling body_style Crosstabulation**
Count
		body_style					Total
		convertible	hardtop	hatchback	sedan	wagon	Total
symboling	-2.00	0	0	0	3	0	3
	-1.00	0	0	2	13	7	22
	.00	0	1	8	43	15	67
	1.00	0	1	27	23	3	54
	2.00	1	4	13	14	0	32
	3.00	5	2	20	0	0	27
Total		6	8	70	96	25	205

The cross tabulation for the variables type of fuel and the body style of the car is given in the following table:

*fueltype body_style Crosstabulation**
Count
		body_style					Total
		convertible	hardtop	hatchback	sedan	wagon	Total
fueltype	diesel	0	1	1	15	3	20
fueltype	gas	6	7	69	81	22	185
Total		6	8	70	96	25	205

Task 3

Discuss how you would ensure that the model you produced in Task 2 are reliable and accurate.

Here we constructed the regression model for the prediction of the likely normalized losses of a given vehicle’s specification, risk factor and normalized losses. For this purpose, we used the ordinal regression model for the prediction of the variable normalized losses. The dependent variable for this ordinal regression model is given as normalized losses and the independent variables for this regression model is given as make of the car, fuel type, aspiration, number of doors, body style, drive wheels and engine location. We used this model because normalised losses are given in the some type of order or we can say that there is specific range or order given for dependent variable. So we used here the ordinal regression model.

Task 4

As noted in the introduction, there are missing values in the data set. Discuss what you would do with these missing values. Do you remove them, attempt to provide values to these unknowns, or attempt a combination of different techniques? Discuss in no more than 500 words.

We observed the missing values for some variables given in the data set. These missing values become a problem when we analyse the data for the further estimation or planning. When we do not know any information about these missing values or if we do not know about the pattern for these missing values, then we not consider these missing values in the analysis. We remove this missing value at the time of analysis or classification because if the missing values includes, then we do not get the accurate estimates or values for our results.

References:

Cohen, J., and Cohen, P. (1975). Applied Multiple Regression and Correlation Analysis for the Behavioral Sciences, Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: A second course in statistics. Reading, MA: Addison-Wesley.
Chatterjee, S. and Price, B. Regression Analysis by Example. Wiley, New York, 1977.

Related Posts

Operations Management Reflection

Clinical Depression

Strategy And The Global Competitive Environment: 526109

About admin