R Programming Assignment on Iris Dataset: 1257515

Introduction

This is an R Markdown Notebook. The dataset in use is Iris dataset. We execute the codes within the notebook and the output is generated beneath the code. The aim of this report is to present an understanding of the relationship between the independent variables. The dependent variable in this problem is Species

Correlation Analysis

In this section, we analyze the correlation between the Species (dependent variable) and other variables.

Screening the variables

iris<-read.csv(“C:\\Users\\310187796\\Desktop\\iris_exams.csv”)
str(iris)

## ‘data.frame’:    300 obs. of  6 variables:
##  $ id          : Factor w/ 300 levels “S001″,”S002”,..: 1 2 3 4 5 6 7 8 9 10 …
##  $ Species     : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …
##  $ Sepal.Length: num  4.75 5.07 5.24 5.48 4.9 …
##  $ Sepal.Width : num  3.3 3.68 3.44 3.96 2.81 …
##  $ Petal.Length: num  1.44 1.21 1.59 1.53 1.49 …
##  $ Petal.Width : num  0.235 0.111 0.405 0.272 0.345 …

myvars <- c(“Species”, “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”)
iris <- iris[myvars]
iris$Species<-as.numeric(iris$Species)
attach(iris)
summary(iris)

##     Species   Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :1   Min.   :4.417   Min.   :1.796   Min.   :1.135  
##  1st Qu.:1   1st Qu.:5.209   1st Qu.:2.720   1st Qu.:1.566  
##  Median :2   Median :5.844   Median :2.992   Median :4.228  
##  Mean   :2   Mean   :5.857   Mean   :3.064   Mean   :3.738  
##  3rd Qu.:3   3rd Qu.:6.448   3rd Qu.:3.375   3rd Qu.:5.205  
##  Max.   :3   Max.   :8.478   Max.   :4.810   Max.   :6.955  
##   Petal.Width      
##  Min.   :-0.03371  
##  1st Qu.: 0.30278  
##  Median : 1.28776  
##  Mean   : 1.19830  
##  3rd Qu.: 1.87452  
##  Max.   : 2.62487

The above output presents the summary data for all the variables in the dataset. From the above findings, it is evident that the average value for the species is 2 with a minimum of 1 and a maximum of 2. There are no missing data points in the dataset neither are there outliers.

Checking for assumptions

In this section we check the assumptions such as linearity, normality and equal variances (homogeneity).

Normality test

hist(iris$Species, xlab=”Species”, main=”Histogram for Species”, col=”green”)

shapiro.test(iris$Species)

##
##  Shapiro-Wilk normality test
##
## data:  iris$Species
## W = 0.79301, p-value < 2.2e-16

From the output, the p-value < 0.05 implying that the distribution of the data are significantly different from normal distribution. In other words, we can assume the Species are not normally distributed.

Results of the correlation test

The results of the correlation test are presented below.

cor(iris,  method = “pearson”, use = “complete.obs”)

##                 Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## Species       1.0000000    0.7434159  -0.5169396    0.9509344   0.9625907
## Sepal.Length  0.7434159    1.0000000  -0.2002927    0.8534125   0.7893659
## Sepal.Width  -0.5169396   -0.2002927   1.0000000   -0.5302401  -0.4680819
## Petal.Length  0.9509344    0.8534125  -0.5302401    1.0000000   0.9653060
## Petal.Width   0.9625907    0.7893659  -0.4680819    0.9653060   1.0000000

From the above results, it can be seen that strong positive relationship exists between Species and Petal width (r = 0.9626). There was also a strong positive relationship between Species and Petal length (r = 0.9509). A moderately strong positive relationship between Species and sepal length (r = 0.7434). lastly, there was a moderate neagtive relationship between Species and sepal width (r = -0.5169).

Regression analysis

The results of the correlation test are presented below.

mod<-lm(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=iris)
summary(mod)

##
## Call:
## lm(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length +
##     Petal.Width, data = iris)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.76982 -0.10135  0.01186  0.11046  0.55591
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.38919    0.13470  10.313  < 2e-16 ***
## Sepal.Length -0.21332    0.03919  -5.444 1.10e-07 ***
## Sepal.Width   0.03303    0.03741   0.883    0.378    
## Petal.Length  0.28771    0.04123   6.978 1.98e-11 ***
## Petal.Width   0.57046    0.06495   8.782  < 2e-16 ***
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 0.1986 on 295 degrees of freedom
## Multiple R-squared:  0.9418, Adjusted R-squared:  0.941
## F-statistic:  1194 on 4 and 295 DF,  p-value: < 2.2e-16

The above results shows that all the predictor variables are significant in the model except sepa width which found to be insignificant in the model (p > 0.05)

Regression analysis with only significant variables

The results of the correlation test are presented below.

mod<-lm(Species~Sepal.Length+Petal.Length+Petal.Width, data=iris)
summary(mod)

##
## Call:
## lm(formula = Species ~ Sepal.Length + Petal.Length + Petal.Width,
##     data = iris)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -0.75220 -0.09782  0.00526  0.10802  0.55905
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.41843    0.13052  10.868  < 2e-16 ***
## Sepal.Length -0.19067    0.02962  -6.438 4.87e-10 ***
## Petal.Length  0.26356    0.03085   8.544 6.97e-16 ***
## Petal.Width   0.59515    0.05860  10.156  < 2e-16 ***
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 0.1985 on 296 degrees of freedom
## Multiple R-squared:  0.9417, Adjusted R-squared:  0.9411
## F-statistic:  1593 on 3 and 296 DF,  p-value: < 2.2e-16

From the above output, it is evident that all the three predictor variables are significant in the model (p < 0.05). The overall model is also seen to be significant [F(3, 146) = 648.3, p = 0.000]. The R-squared value is 0.9287; this means that 92.87% of the variation in the dependent variable (Species) is explained by the three predictor variables in the model. The coefficient of sepal length is -0.1907; this implies that a unit increase in the sepal length is expected to result in a decrease in the species by 0.1907. The coefficient of petal length is 0.2636; this implies that incresing the petal length by a unit would result in an increase in the species by 0.2636. Lastly, the coefficient of petal width is 0.5952; this implies that incresing the petal width by a unit would result in an increase in the species by 0.5952.

ANOVA analysis

This section presents ANOVA test with Species as the dependent variable.

anova_one_way <- aov(Species~Sepal.Length, data = iris)
summary(anova_one_way)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## Sepal.Length   1 110.53   110.5   368.2 <2e-16 ***
## Residuals    298  89.47     0.3                   
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

There was significant effect of Sepal length on Speices at the p<.05 level for the given conditions [F(1, 298) = 368.2, p = 0.000].

T-test analysis

This section presents the t-test with sepal length as the dependent variable. We sought to test the following hypothesis. Null Hypothesis (H0) : Sepal.Length has no effect on Species (Setosa & Versicolor Only). That is to say that the difference between the observed Sepal.Length values for various Species are not statistically different

Alternate Hypothese (HA) : Sepal.Length has some effect on Species (Setosa & Versicolor Only). That is to say that the difference between the observed Sepal.Length values for various Species are in fact different from each other.

iris<-read.csv(“C:\\Users\\310187796\\Desktop\\iris_exams.csv”)
str(iris)

## ‘data.frame’:    300 obs. of  6 variables:
##  $ id          : Factor w/ 300 levels “S001″,”S002”,..: 1 2 3 4 5 6 7 8 9 10 …
##  $ Species     : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …
##  $ Sepal.Length: num  4.75 5.07 5.24 5.48 4.9 …
##  $ Sepal.Width : num  3.3 3.68 3.44 3.96 2.81 …
##  $ Petal.Length: num  1.44 1.21 1.59 1.53 1.49 …
##  $ Petal.Width : num  0.235 0.111 0.405 0.272 0.345 …

newiris <- subset(iris, Species==“setosa” | Species==“versicolor”)
str(newiris)

## ‘data.frame’:    200 obs. of  6 variables:
##  $ id          : Factor w/ 300 levels “S001″,”S002”,..: 1 2 3 4 5 6 7 8 9 10 …
##  $ Species     : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …
##  $ Sepal.Length: num  4.75 5.07 5.24 5.48 4.9 …
##  $ Sepal.Width : num  3.3 3.68 3.44 3.96 2.81 …
##  $ Petal.Length: num  1.44 1.21 1.59 1.53 1.49 …
##  $ Petal.Width : num  0.235 0.111 0.405 0.272 0.345 …

t.test(Sepal.Length~Species, data=newiris)

##
##  Welch Two Sample t-test
##
## data:  Sepal.Length by Species
## t = -14.045, df = 157.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.0176746 -0.7667257
## sample estimates:
##     mean in group setosa mean in group versicolor
##                 5.093869                 5.986069

From the results presented below, it is evident that there is no significant difference in the sepal length for the Setosa and versicolor species (p > 0.05).