Regression Models using Cross Section Data
Use the data set in DATA_ASSIGNMENT which contains information on number of medals won by each country between 1960 and 1999 in the Olympic Games and the characteristics of these countries. Country ID is the country identifier. Year denotes the year when the Olympics games held. Real GDP is the Real Gross Domestic Product of a country in millions of dollars. Population is the number of people living in a country in millions of people. Total Medals in the sum of gold, silver and bronze medals won by a country. Host Country is a dummy variable that takes the value 1 if the country is hosting the Olympic Games and takes the value 0 if the country is not hosting the games. Planned Economy is a dummy variable that takes the value 1 if the country is a planned economy and is not a member of Soviet Union and 0 otherwise. Soviet Union Member is a dummy variable that takes the value 1 if the country is a member of Soviet Union and takes the value 0 if the country is not a member.
Questions
- Present the descriptive statistics of the variables Real GDP, Population, Total Medals. Comment on the means and measures of dispersion of the variables.
Solution
- Real Gross Domestic Product (GDP)
The descriptive statistics of the Real Gross Domestic Product in millions of dollars is given in table 1 below. The mean, median, variance and standard deviation are 137726.658, 9110, 3.00986E+11 and 548622.1516 respectively. The skewness of the real gross domestic data is 8.0229. This is a positive value indicating that the data is positively skewed (Little, Deboek and Wu, 2015, p. 35).
Table 1: Descriptive Statistics (GDP-Millions of Dollars)
Descriptive Statistics | (GDP millions of dollars) |
Mean | 137726.658 |
Standard Error | 15492.60937 |
Median | 9110 |
Mode | 1100 |
Standard Deviation | 548622.1516 |
Sample Variance | 3.00986E+11 |
Kurtosis | 76.05248117 |
Skewness | 8.022897218 |
Range | 7279954 |
Minimum | 46 |
Maximum | 7280000 |
Sum | 172709229.2 |
Count | 1254 |
Confidence Level (95.0%) | 30394.31601 |
- Population (Millions of People)
The descriptive statistics of the Population in millions is given as shown in the table 2 below. The mean, median, variance and standard deviation are 27.53976778, 2.640256344, 8741.575765 and 7.020635128 respectively. The skewness of the population data is positive value suggesting a relative skewness in the data (Little, Deboeck and Wu, 2015, p.49).
Table 2: Descriptive Statistics of “Population” in millions
Descriptive Statistics (Population in millions) | |
Mean | 27.53976778 |
Standard Error | 2.640256344 |
Median | 7.020635128 |
Mode | 0.02 |
Standard Deviation | 93.4963944 |
Sample Variance | 8741.575765 |
Kurtosis | 86.37505268 |
Skewness | 8.618197335 |
Range | 1219.98504 |
Minimum | 0.01496041 |
Maximum | 1220 |
Sum | 34534.8688 |
Count | 1254 |
Confidence Level (95.0%) | 5.17981082 |
- Total Medals
The descriptive statistics of the total medals earned by a given country is given as shown in the table 3 below. The mean, median and standard deviation are 5.07496, 0 and 16.17332 respectively. The skewness of the population data is positive value suggesting a relative skewness in the data (Malash and El-Khaiary, 2010, p. 21).
Table 3: Descriptive Statistics of “Total Medals”
Descriptive Statistics (Total Medals) | |
Mean | 5.07496 |
Standard Error | 0.45672 |
Median | 0 |
Mode | 0 |
Standard Deviation | 16.17332 |
Sample Variance | 261.5762 |
Kurtosis | 44.18286 |
Skewness | 5.948003 |
Range | 195 |
Minimum | 0 |
Maximum | 195 |
Sum | 6364 |
Count | 1254 |
Confidence Level (95.0%) | 0.896021 |
- Estimate the following simple regression model of total medals on real GDP.
TotalMedals=β0+ β1realGDP + u
Write down the sample regression function and interpret the coefficient estimates.
Solution
The regression model of the total medals on real GDP is given as;
TotalMedals =β0+ β1realGDP+u
Where, TotalMedals = Dependent variable of the model
β0 = constant term or the y-intercept
realGDP = independent variable
β1 = coefficient of real GDP
u = the error term
By estimating the simple regression model of total medals on real GDP, the following excel output is produced;
SUMMARY OUTPUT | ||||||||
Regression Statistics | ||||||||
Multiple R | 0.6445 | |||||||
R Square | 0.41538 | |||||||
Adjusted R Square | 0.414913 | |||||||
Standard Error | 12.37113 | |||||||
Observations | 1254 | |||||||
ANOVA | ||||||||
df | SS | MS | F | Significance F | ||||
Regression | 1 | 136142.9 | 136142.9 | 889.5624 | 4E-148 | |||
Residual | 1252 | 191612.1 | 153.0448 | |||||
Total | 1253 | 327755 | ||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
Intercept | 2.458184 | 0.360198 | 6.824527 | 1.37E-11 | 1.751525 | 3.164843 | 1.751525 | 3.164843 |
Real GDP | 1.9E-05 | 6.37E-07 | 29.82553 | 4E-148 | 1.78E-05 | 2.02E-05 | 1.78E-05 | 2.02E-05 |
Based on the results above, the coefficients; β0 = 2.458184 (constant term) and β1 = 1.9E-05 (coefficient of the realGDP).
Thus the sample regression function is; TotalMedals = 2.458184 + 1.9E-05(realGDP) + u
- Now estimate the following simple regression model with a level-log specification,
TotalMedals=β0+ β1log (realGDP) + u
Solution
By estimating the simple regression model with a level-log specification, the result of the model is obtained as given in the excel output below.
SUMMARY OUTPUT | ||||||||
Regression Statistics | ||||||||
Multiple R | 0.48035 | |||||||
R Square | 0.230736 | |||||||
Adjusted R Square | 0.230122 | |||||||
Standard Error | 14.19091 | |||||||
Observations | 1254 | |||||||
ANOVA | ||||||||
df | SS | MS | F | Significance F | ||||
Regression | 1 | 75624.95 | 75624.95 | 375.5302 | 2.26E-73 | |||
Residual | 1252 | 252130 | 201.3818 | |||||
Total | 1253 | 327755 | ||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
Intercept | -27.0712 | 1.706567 | -15.863 | 8.89E-52 | -30.4193 | -23.7232 | -30.4193 | -23.7232 |
log(real GDP) | 3.443017 | 0.177671 | 19.3786 | 2.26E-73 | 3.094451 | 3.791583 | 3.094451 | 3.791583 |
Based on the above results, the coefficients are obtained as; β0 = -27.0712, β1 = 3.443017.
Report your regression results in a sample regression function
The regression function of the model with a level -log specification can thus be written as;
TotalMedals = –27.0712 + 3.443017log (realGDP) + u
Interpret the estimated coefficient of log (realGDP).What did you expect this coefficient to be before the estimation and is the sign of this estimate what you expect it to be? Provide an explanation.
The estimated coefficient of log (realGDP) according the results is β1 = 3.443017. The coefficient is positive value thus indicating a positive relation between the Total medals earned by a country and the real Gross Domestic Product (GDP). According to me, this coefficient ought to be a positive value and indeed after the estimation, the results confirms this. This is because the total medals earned may only be positively contributed or associated with the real GDP and not otherwise since GDP is a continuous variable (Barreto, 2015)
- A model that relates the total number of medals to the realGDP and population is:
TotalMedals=β0+ β1realGDP+ β2population+u
Report your results in a sample regression function. What can you conclude regarding comparison of the goodness of fit of this regression model versus the regression model in part (ii)?
Solution
SUMMARY OUTPUT | ||||||||
Regression Statistics | ||||||||
Multiple R | 0.660607 | |||||||
R Square | 0.436402 | |||||||
Adjusted R Square | 0.435501 | |||||||
Standard Error | 12.15153 | |||||||
Observations | 1254 | |||||||
ANOVA | ||||||||
df | SS | MS | F | Significance F | ||||
Regression | 2 | 143032.8 | 71516.39 | 484.3327 | 1.7E-156 | |||
Residual | 1251 | 184722.2 | 147.6596 | |||||
Total | 1253 | 327755 | ||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
Intercept | 1.911994 | 0.362727 | 5.271159 | 1.6E-07 | 1.200373 | 2.623615 | 1.200373 | 2.623615 |
Real GDP | 1.77E-05 | 6.53E-07 | 27.18071 | 3.1E-128 | 1.65E-05 | 1.9E-05 | 1.65E-05 | 1.9E-05 |
Population | 0.026154 | 0.003829 | 6.830858 | 1.31E-11 | 0.018643 | 0.033666 | 0.018643 | 0.033666 |
Based on the above results, the coefficients are obtained as; β0 = 1.911994, β1 = 1.77E-05 and β2 = 0.026154. The regression function of the model can thus be given as;
TotalMedals= 1.911994 + 1.77E-05realGDP+ 0.026154Population +u
Where u according to (Reed, Kaplan and Brewer, 2012, p. 54) is the error term to the model.
In regard to the goodness of fit of this regression model versus the regression model in part (ii), this regression model has a better fit than that of (ii). That is; R square of 0.436402 i.e. 43% as compared to 0.41538 of the (ii) above.
- Now re-estimate the equation in (IV) but using the log of independent variables. That is, estimate the model,
TotalMedals =β0+ β1log (realGDP) + β2log (population) +u
Report the results in a sample regression function. Interpret the coefficient of population. Test whether it is statistically significant at 1% level.
Solution
SUMMARY OUTPUT | ||||||||
Regression Statistics | ||||||||
Multiple R | 0.480379 | |||||||
R Square | 0.230764 | |||||||
Adjusted R Square | 0.229534 | |||||||
Standard Error | 14.19632 | |||||||
Observations | 1254 | |||||||
ANOVA | ||||||||
df | SS | MS | F | Significance F | ||||
Regression | 2 | 75633.92 | 37816.96 | 187.6441 | 5.38E-72 | |||
Residual | 1251 | 252121 | 201.5356 | |||||
Total | 1253 | 327755 | ||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
Intercept | -27.3506 | 2.160388 | -12.66 | 1.19E-34 | -31.5889 | -23.1122 | -31.5889 | -23.1122 |
log(Real GDP) | 3.48385 | 0.262752 | 13.25909 | 1.22E-37 | 2.968367 | 3.999332 | 2.968367 | 3.999332 |
log(Population) | -0.06139 | 0.290937 | -0.21101 | 0.832918 | -0.63217 | 0.50939 | -0.63217 | 0.50939 |
Based on the above results, the coefficients are obtained as; β0 = -27.3506, β1 = 3.48385 and β2 = -0.06139. The regression function of the model can thus be written as;
TotalMedals = -27.3506 + 3.48385 log (realGDP) + -0.06139log (population) +u
- Using the estimated model in (v), test whether realGDP has a positive effect on total medals at 1% level of significance.
By using the model; TotalMedals=-27.3506 + 3.48385 log (realGDP) + -0.06139log (Population) +u, we can perform a hypothesis test of the “significance of the correlation coefficient” to decide whether there exist a positive effect on total medals at 1% level of significance by considering the p-value (Barati, 2013).
Null hypothesis; β1 = 0
Alternate Hypothesis: β1 ≠ 0
In the model above, the p value is obtained to be 5.38E-72. This value is less than critical value at 1% level of significance and thus we reject the null hypothesis hence we can conclude that real GDP has a positive effect on total medals earned by a country.
- Add the variables “planned economy” and “host country” to the level-log equation in (v) and estimate the following model.
TotalMedals=β0+ β1log (realGDP) + β2log (population) + β3plannedeconomy+ β4hostcountry+ u
Solution
SUMMARY OUTPUT | ||||||||
Regression Statistics | ||||||||
Multiple R | 0.544554 | |||||||
R Square | 0.29654 | |||||||
Adjusted R Square | 0.294287 | |||||||
Standard Error | 13.58668 | |||||||
Observations | 1254 | |||||||
ANOVA | ||||||||
df | SS | MS | F | Significance F | ||||
Regression | 4 | 97192.3 | 24298.07 | 131.6271 | 7.43E-94 | |||
Residual | 1249 | 230562.7 | 184.5978 | |||||
Total | 1253 | 327755 | ||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
Intercept | -24.7138 | 2.082675 | -11.8664 | 7.5E-31 | -28.7997 | -20.6279 | -28.7997 | -20.6279 |
log(Real GDP) | 3.155271 | 0.253352 | 12.45408 | 1.21E-33 | 2.658227 | 3.652314 | 2.658227 | 3.652314 |
log(Population) | -0.06609 | 0.27944 | -0.2365 | 0.813083 | -0.61431 | 0.482137 | -0.61431 | 0.482137 |
PlannedEconomy | 3.966477 | 3.083846 | 1.286211 | cc | -2.08361 | 10.01657 | -2.08361 | 10.01657 |
HostCountry | 47.10333 | 4.378378 | 10.75817 | 7.09E-26 | 38.51354 | 55.69312 | 38.51354 | 55.69312 |
The model can thus be written as;
TotalMedals=-24.7138+ 3.155271 log (realGDP) + -0.06609 log (population) +3.966477 plannedeconomy+47.10333 hostcountry+ u
Test whether planned economy variable and host country variables are individually significant at 1% level?
In this case we perform a hypothesis testing to determine if the two variables are individually significant at 1% level of significance.
The null and alternate hypotheses are thus stated as follows;
Null hypothesis: H0 = β3 = β4
Alternate hypothesis: H1= β3 ≠ β4
`Here we test whether planned economy and host country are individually significant by performing a t test.
Example: H0: β3 = β4 against H1: β3 ≠ β4 at significance level α = .01.
Then
t = (b2 – H0 value of β1) / (standard error of b2)
= (12.33647 – 1.0) / 1.41270 = 11.09412
By using the p-value approach, t-value = 2.579759. Thus we need to reject the null hypothesis and conclude that both planned economy and host country are not individually independence at 1% significance level.
Also, by performing a t test for this in excel, the following result is obtained;
t-Test: Two-Sample Assuming Unequal Variances | ||
Planned Economy | Host Country | |
Mean | 5.07496 | 0.007974 |
Variance | 261.5762 | 0.007917 |
Observations | 1254 | 1254 |
Hypothesized Mean Difference | 0 | |
df | 1253 | |
t Stat | 11.09412 | |
P(T<=t) one-tail | 1.2E-27 | |
t Critical one-tail | 2.329328 | |
P(T<=t) two-tail | 2.4E-27 | |
t Critical two-tail | 2.579759 |
Test if plannedeconomy and hostcountry variables are jointly significant at 5% level?
We test H0: β3 = 0 and β4 ≠ 0 versus Ha: at least one of β1 and β2 does not equal zero.
From the ANOVA table the F-test statistic is 0.504052 with p-value of 0.91122.
Since the p-value is not less than 0.05 we do not reject the null hypothesis and hence conclude that both planned economy and host country are jointly statistically significance at 5% level.
The excel output for the analysis also give the results as shown in the table below
F-Test Two-Sample for Variances | ||
Planned Economy | Host Country | |
Mean | 0.007974 | 0.015949 |
Variance | 0.007917 | 0.015707 |
Observations | 1254 | 1254 |
df | 1253 | 1253 |
F | 0.504052 | |
P(F<=f) one-tail | 0 | |
F Critical one-tail | 0.91122 |
Test the overall significance of the model you estimated in part (vii) at 1% level of significance.
We test H0: β2 = 0 and β3 = 0 versus H1: at least one of β2 and β3 does not equal zero. From the ANOVA table the F-test statistic is 131.6271 with p-value of 7.43E-94.
Since the p-value is less than 0.01 we reject the null hypothesis that the regression parameters are zero at significance level 0.01.
- Suppose you want to test whether Soviet Union Member countries win more medals than other countries. Specify a regression model that will enable you to test such a hypothesis using the model in (v) as a base. Report your results in a sample regression function and perform the hypothesis test at 5% level of significance. What would you infer?
SUMMARY OUTPUT | ||||||||
Regression Statistics | ||||||||
Multiple R | 0.544554 | |||||||
R Square | 0.29654 | |||||||
Adjusted R Square | 0.294287 | |||||||
Standard Error | 13.58668 | |||||||
Observations | 1254 | |||||||
ANOVA | ||||||||
df | SS | MS | F | Significance F | ||||
Regression | 4 | 97192.3 | 24298.07 | 131.6271 | 7.43E-94 | |||
Residual | 1249 | 230562.7 | 184.5978 | |||||
Total | 1253 | 327755 | ||||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
Intercept | -24.7138 | 2.082675 | -11.8664 | 7.5E-31 | -28.7997 | -20.6279 | -28.7997 | -20.6279 |
log(Real GDP) | 3.155271 | 0.253352 | 12.45408 | 1.21E-33 | 2.658227 | 3.652314 | 2.658227 | 3.652314 |
log(Population) | -0.06609 | 0.27944 | -0.2365 | 0.813083 | -0.61431 | 0.482137 | -0.61431 | 0.482137 |
PlannedEconomy | 3.966477 | 3.083846 | 1.286211 | cc | -2.08361 | 10.01657 | -2.08361 | 10.01657 |
HostCountry | 47.10333 | 4.378378 | 10.75817 | 7.09E-26 | 38.51354 | 55.69312 | 38.51354 | 55.69312 |
Hypothesis testing;
We test H0: β1 = 0 and β2 ≥ 0 versus Ha: at least one of β1 and β2 does not equal zero.
From the ANOVA table the F-test statistic is 4.0635 with p-value of 0.1975.
Since the p-value is not less than 0.05 we do not reject the null hypothesis that the regression parameters are zero at significance level 0.05 (Hilbe, 2009).
Conclude that the parameters are jointly statistically insignificant at significance level 0.05.
References
Barati, R., (2013). Application of excel solver for parameter estimation of the nonlinear Muskingum models. KSCE Journal of Civil Engineering, 17(5), pp.1139-1148.
Barreto, H., (2015). Why Excel? The Journal of Economic Education, 46(3), pp.300-309.
Hilbe, J.M., (2009). Logistic regression models. Chapman and hall/CRC.
Hill, R.C., Griffiths, W.E. and Lim, G.C., (2018). Principles of econometrics. John Wiley & Sons.
Little, T.D., Deboeck, P. and Wu, W., (2015). Longitudinal data analysis. Emerging Trends in the Social and Behavioral Sciences: An Interdisciplinary, Searchable, and Linkable Resource, pp.1-17.
Malash, G.F. and El-Khaiary, M.I., (2010). Piecewise linear regression: A statistical method for the analysis of experimental adsorption data by the intraparticle-diffusion models. Chemical Engineering Journal, 163(3), pp.256-263.
Reed, D.D., Kaplan, B.A. and Brewer, A.T., (2012). A tutorial on the use of Excel 2010 and Excel for Mac 2011 for conducting delay‐discounting analyses. Journal of applied behavior analysis, 45(2), pp.375-386.
Wilson, J.H., Keating, B.P. and Beal, M., (2015). Regression analysis: understanding and building business and economic models using Excel. Business Expert Press.