Classification Tree and Logistic Regressions: 1331739

 Introduction

Data science is one of the fields which have emerge due to increase in technology. This has resulted to easy access of data from different companies.  In this paper we shall discuss three different types of machine learning algorithms that is classification tree, binary logistic regression and logistic elastic net regression. The classification tree is type of decision tree which differs with regression tree based on the categorical response. A decision tree is one of the algorithms which appears to be similar with an inverted tree and each node indicates an independent variable, however the link between the nodes represents a decision and each leaf node indicates the response variable (Zheng et al., 2019). The logistic regression is a type of algorithm which uses the logit function to build a categorical dependent variable. In most cases it is applied to dependent variables having binary outcomes. The logistic elastic net regression is a combination of both ridge and LASSO regression. The search range for alpha is between 0 ≤ α ≤1. When alpha is close to zero then the elastic net regression is of ridge type and when alpha is close to 1 then it’s of LASSO type (Eddama et al., 2019).

To begin the analysis, we first import the malware samples data into R studio by loading the readxl package. We then view the structure and the dimension of the data set. Thereafter we remove the column specimenid since it won’t contribute much in the analysis. The data set provided have no missing values hence no need of data preprocessing. We then split the malware sample data into 80% train and 20% test. We also set the seed using the student number for reproducibity.

## Loading the data set, checking the dimension and structure of the data set

library(“readxl”)

malwaresample = read.csv(‘malwaresample.csv’)

dim(malwaresample)

str(malwaresample)

## removing the specimenid from the data set

malwaresample$specimenId <- NULL

dim(malwaresample)

## set. Seed using student number

set.seed(10336051)

## splitting the malwaresample into train and test

train = malwaresample[1:8000,]

test = malwaresample[8001:10000,]

We begin by creating the binary logistic regression on the train malware sample data. The name function has been used to list the various names of the variables in the data set. The dependent variable ismalware represents yes to mean that the email is malicious and no to show that the email is legitimate. We have further set family as binomial because we want to predict two outcomes as defined by the dependent variable. The model created has been named logit.

## creating binary logistic regresssion

names(malwaresample)

logit = glm(isMalware~hasExe + hasUnknown + senderDomainSuffix + hasZip + hasURL +

              headerCount + hasPDF + urlCount + hasDoc + totalEmailSizeBytes, data = train,       family = “binomial”)

The above model created is over fitted and therefore to ensure that this doesn’t happen, we need to use the recursive feature elimination. We begin the process by loading the caret package which will provide the rfecontrol function for setting different parameters to be used in the recursive feature elimination model. The rfe model created has been named lmProfile and the dependent variable have been used to predict the other independent features on the train data. Moreover, we have set the size to be similar with the column names.

library(caret)

control <- rfeControl(functions = lrFuncs,

                      method = “repeatedcv”,#cross validation

                      repeats = 10,

                      verbose = FALSE, #prevents copious amounts of output from being produced.

)

##Perform RFE specifying the formula instead.

lmProfile <- rfe(isMalware~., #Specifying isMalware as a function all other variables.

                 data=train[,1:11], #training data set  

                 sizes = c(1:11), rfeControl = control)

The optimal model from the rfe is the fit model which can be easily accessed and its coefficients analyzed. Next we want to use this model to evaluate the test data and since the test data contain some features which are categorical, we shall use the model.matrix function to convert the test data to numerical dummy test and removing the first addition column and then converting it to a data frame. We then use the predict function to obtain the prediction from the model on the dummy test data which we then round off to three decimal places.

#Access the optimal model and show its coefficients

lmProfile$fit

#Summarise the optimal model and show the significance of the features

summary(lmProfile$fit)

#Convert the categorical features to dummy variables,remove the first

#column and convert the whole data matrix to a data frame.

dummy.test <- model.matrix(~.,data=test)[,-1] %>% data.frame

pred.optmod <- predict(lmProfile$fit, newdata=dummy.test)

pred.optmod %>% round(digits=3) #Show prediction to 3 dp

The lmprofile$fit model created has a sensitivity of 69.59%, specificity of 84.75% and the overall accuracy of 75.4% with test data.

table(dummy.test$isMalwareYes, pred.optmod > 0.5)

    FALSE TRUE

  0   858  117

  1   375  650

> ## sensitivity

> (858/(858+375))

[1] 0.6958637

> ##specificity

> (650/(650+117))

[1] 0.8474576

> ## overal accuracy

> (858+650)/2000

[1] 0.754

The next model to create is the logistic elastic net regression in which we first set the seed using the student number. We then apply the traincontrol function to specify the method and the number of iterations. Thereafter, we use the train function to create the model on the train data using the glmnet method. The model created has been named hit_elnet.

## creating logistic elastic net regression

set.seed(10336051)

cv_5 = trainControl(method = “cv”, number = 5)

hit_elnet = train(

  isMalware ~ ., data = train,

  method = “glmnet”,

  trControl = cv_5

)

hit_elnet

hit_elnet_int = train(

  isMalware ~ . ^ 2, data = train,

  method = “glmnet”,

  trControl = cv_5,

  tuneLength = 10

)

To optimize the hyper parameter of the above model we first square the dependent variable and then use a larger tune length of about 10 such that we are able to obtain many models and from there we shall select the best model to use on the test data.

get_best_result = function(caret_fit) {

+   best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))

+   best_result = caret_fit$results[best, ]

+   rownames(best_result) = NULL

+   best_result

+ }

> get_best_result(hit_elnet_int)

  alpha      lambda  Accuracy     Kappa AccuracySD    KappaSD

1   0.5 0.001307656 0.7695026 0.5402999 0.01578346 0.03147032

To obtain the best model from the many models created and stored in the hit_elnet_int, we first create the function and name it get_best_result. When the assigned model is entered into the function we obtain the best model having alpha of 0.5. The alpha of 0.5 shows that the model created is a mixture of LASSO and ridge logistic regressions.

confusionMatrix(ta)

Confusion Matrix and Statistics

     predicted

       No Yes

  No  858 117

  Yes 345 680

               Accuracy : 0.769           

                 95% CI : (0.7499, 0.7873)

    No Information Rate : 0.6015          

    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.5403          

 Mcnemar’s Test P-Value : < 2.2e-16       

            Sensitivity : 0.7132          

            Specificity : 0.8532          

The above created elastic net logistic regression has an accuracy of 76.9%, sensitivity of 71.32% and specificity of 85.32% when evaluated with the test data.

# building the classification tree with rpart

library(rpart)

penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)

tree <- rpart(isMalware~.,

              data=train,

              parms = list(loss = penalty.matrix),

              method = “class”)

# choosing the best complexity parameter “cp” to prune the tree

cp.optim <- tree$cptable[which.min(tree$cptable[,”xerror”]),”CP”]

# tree prunning using the best complexity parameter. For more in

tree <- prune(tree, cp=cp.optim)

The next model to construct using the train data set is the classification tree which is built using rpart package. We then set the penalty.matrix of larger size of 10 so that we classify more legitimate emails as malicious than malicious emails as legitimate. We then select best complexity parameter to prune the created tree. We use the prune function on the created model together with the complexity parameter to get the best classification tree which has been named tree.

confusionMatrix(t)

Confusion Matrix and Statistics

     pred6

         No   Yes

  No      0 42568

  Yes     0  7432

               Accuracy : 0.1486          

                 95% CI : (0.1455, 0.1518)

    No Information Rate : 1               

    P-Value [Acc > NIR] : 1               

                  Kappa : 0               

 Mcnemar’s Test P-Value : <2e-16          

            Sensitivity :     NA          

            Specificity : 0.1486          

The confusion matrix of the classification tree on test data gives the overal accuracy of 14.86%, sensitivity is not given and the specificity is 14.86%.

                                  Classification tree      Binary regression        Elastic net regression
Overal accuracy 14.86% 75.4% 76.9%Sensitivity NA 69.59% 71.32%Specificity 14.86% 84.75% 85.32%

The above table shows how the three models have performed on the test data. The elastic net regression has the highest accuracy as well as the sensitivity and the classification tree has the least specificity. From the introduction we wanted a model which have high sensitivity and least specificity and therefore based on the above findings we can select the model to use for the investigation.

Now to apply the already created model with the real data set. We first load the world data set into R studio. Thereafter, we run each of the modified model on the world data and obtain the following results.

                                  Classification tree      Binary regression        Elastic net regression
Overal accuracy 14.86% 93.66% 74.71%Sensitivity NA 94.71% 95.26%Specificity 14.86% 85.87% 34.62%

The above table shows how the three models have performed on the world real data. The elastic net regression has the highest sensitivity whereas the binary logistic regression has the highest overall accuracy and the classification tree has the least specificity. Therefore, I would recommend the company to pick the elastic net regression since it has higher sensitivity and moderately low specificity.

References

Zheng, Y., Duarte, C. M., Chen, J., Li, D., Lou, Z., & Wu, J. (2019). Remote sensing mapping of macroalgal farms by modifying thresholds in the classification tree. Geocarto International, 34(10), 1098-1108.

Eddama, M. M. R., Fragkos, K. C., Renshaw, S., Aldridge, M., Bough, G., Bonthala, L., … & Cohen, R. (2019). Logistic regression model to predict acute uncomplicated and complicated appendicitis. The Annals of The Royal College of Surgeons of England, 101(2), 107-118.