Introduction
Data science is one of the fields which have emerge due to increase in technology. This has resulted to easy access of data from different companies. In this paper we shall discuss three different types of machine learning algorithms that is classification tree, binary logistic regression and logistic elastic net regression. The classification tree is type of decision tree which differs with regression tree based on the categorical response. A decision tree is one of the algorithms which appears to be similar with an inverted tree and each node indicates an independent variable, however the link between the nodes represents a decision and each leaf node indicates the response variable (Zheng et al., 2019). The logistic regression is a type of algorithm which uses the logit function to build a categorical dependent variable. In most cases it is applied to dependent variables having binary outcomes. The logistic elastic net regression is a combination of both ridge and LASSO regression. The search range for alpha is between 0 ≤ α ≤1. When alpha is close to zero then the elastic net regression is of ridge type and when alpha is close to 1 then it’s of LASSO type (Eddama et al., 2019).
To begin the analysis, we first import the malware samples data into R studio by loading the readxl package. We then view the structure and the dimension of the data set. Thereafter we remove the column specimenid since it won’t contribute much in the analysis. The data set provided have no missing values hence no need of data preprocessing. We then split the malware sample data into 80% train and 20% test. We also set the seed using the student number for reproducibity.
## Loading the data set, checking the dimension and structure of the data set
library(“readxl”)
malwaresample = read.csv(‘malwaresample.csv’)
dim(malwaresample)
str(malwaresample)
## removing the specimenid from the data set
malwaresample$specimenId <- NULL
dim(malwaresample)
## set. Seed using student number
set.seed(10336051)
## splitting the malwaresample into train and test
train = malwaresample[1:8000,]
test = malwaresample[8001:10000,]
We begin by creating the binary logistic regression on the train malware sample data. The name function has been used to list the various names of the variables in the data set. The dependent variable ismalware represents yes to mean that the email is malicious and no to show that the email is legitimate. We have further set family as binomial because we want to predict two outcomes as defined by the dependent variable. The model created has been named logit.
## creating binary logistic regresssion
names(malwaresample)
logit = glm(isMalware~hasExe + hasUnknown + senderDomainSuffix + hasZip + hasURL +
headerCount + hasPDF + urlCount + hasDoc + totalEmailSizeBytes, data = train, family = “binomial”)
The above model created is over fitted and therefore to ensure that this doesn’t happen, we need to use the recursive feature elimination. We begin the process by loading the caret package which will provide the rfecontrol function for setting different parameters to be used in the recursive feature elimination model. The rfe model created has been named lmProfile and the dependent variable have been used to predict the other independent features on the train data. Moreover, we have set the size to be similar with the column names.
library(caret)
control <- rfeControl(functions = lrFuncs,
method = “repeatedcv”,#cross validation
repeats = 10,
verbose = FALSE, #prevents copious amounts of output from being produced.
)
##Perform RFE specifying the formula instead.
lmProfile <- rfe(isMalware~., #Specifying isMalware as a function all other variables.
data=train[,1:11], #training data set
sizes = c(1:11), rfeControl = control)
The optimal model from the rfe is the fit model which can be easily accessed and its coefficients analyzed. Next we want to use this model to evaluate the test data and since the test data contain some features which are categorical, we shall use the model.matrix function to convert the test data to numerical dummy test and removing the first addition column and then converting it to a data frame. We then use the predict function to obtain the prediction from the model on the dummy test data which we then round off to three decimal places.
#Access the optimal model and show its coefficients
lmProfile$fit
#Summarise the optimal model and show the significance of the features
summary(lmProfile$fit)
#Convert the categorical features to dummy variables,remove the first
#column and convert the whole data matrix to a data frame.
dummy.test <- model.matrix(~.,data=test)[,-1] %>% data.frame
pred.optmod <- predict(lmProfile$fit, newdata=dummy.test)
pred.optmod %>% round(digits=3) #Show prediction to 3 dp
The lmprofile$fit model created has a sensitivity of 69.59%, specificity of 84.75% and the overall accuracy of 75.4% with test data.
table(dummy.test$isMalwareYes, pred.optmod > 0.5)
FALSE TRUE
0 858 117
1 375 650
> ## sensitivity
> (858/(858+375))
[1] 0.6958637
> ##specificity
> (650/(650+117))
[1] 0.8474576
> ## overal accuracy
> (858+650)/2000
[1] 0.754
The next model to create is the logistic elastic net regression in which we first set the seed using the student number. We then apply the traincontrol function to specify the method and the number of iterations. Thereafter, we use the train function to create the model on the train data using the glmnet method. The model created has been named hit_elnet.
## creating logistic elastic net regression
set.seed(10336051)
cv_5 = trainControl(method = “cv”, number = 5)
hit_elnet = train(
isMalware ~ ., data = train,
method = “glmnet”,
trControl = cv_5
)
hit_elnet
hit_elnet_int = train(
isMalware ~ . ^ 2, data = train,
method = “glmnet”,
trControl = cv_5,
tuneLength = 10
)
To optimize the hyper parameter of the above model we first square the dependent variable and then use a larger tune length of about 10 such that we are able to obtain many models and from there we shall select the best model to use on the test data.
get_best_result = function(caret_fit) {
+ best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
+ best_result = caret_fit$results[best, ]
+ rownames(best_result) = NULL
+ best_result
+ }
> get_best_result(hit_elnet_int)
alpha lambda Accuracy Kappa AccuracySD KappaSD
1 0.5 0.001307656 0.7695026 0.5402999 0.01578346 0.03147032
To obtain the best model from the many models created and stored in the hit_elnet_int, we first create the function and name it get_best_result. When the assigned model is entered into the function we obtain the best model having alpha of 0.5. The alpha of 0.5 shows that the model created is a mixture of LASSO and ridge logistic regressions.
confusionMatrix(ta)
Confusion Matrix and Statistics
predicted
No Yes
No 858 117
Yes 345 680
Accuracy : 0.769
95% CI : (0.7499, 0.7873)
No Information Rate : 0.6015
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5403
Mcnemar’s Test P-Value : < 2.2e-16
Sensitivity : 0.7132
Specificity : 0.8532
The above created elastic net logistic regression has an accuracy of 76.9%, sensitivity of 71.32% and specificity of 85.32% when evaluated with the test data.
# building the classification tree with rpart
library(rpart)
penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)
tree <- rpart(isMalware~.,
data=train,
parms = list(loss = penalty.matrix),
method = “class”)
# choosing the best complexity parameter “cp” to prune the tree
cp.optim <- tree$cptable[which.min(tree$cptable[,”xerror”]),”CP”]
# tree prunning using the best complexity parameter. For more in
tree <- prune(tree, cp=cp.optim)
The next model to construct using the train data set is the classification tree which is built using rpart package. We then set the penalty.matrix of larger size of 10 so that we classify more legitimate emails as malicious than malicious emails as legitimate. We then select best complexity parameter to prune the created tree. We use the prune function on the created model together with the complexity parameter to get the best classification tree which has been named tree.
confusionMatrix(t)
Confusion Matrix and Statistics
pred6
No Yes
No 0 42568
Yes 0 7432
Accuracy : 0.1486
95% CI : (0.1455, 0.1518)
No Information Rate : 1
P-Value [Acc > NIR] : 1
Kappa : 0
Mcnemar’s Test P-Value : <2e-16
Sensitivity : NA
Specificity : 0.1486
The confusion matrix of the classification tree on test data gives the overal accuracy of 14.86%, sensitivity is not given and the specificity is 14.86%.
Classification tree Binary regression Elastic net regression |
Overal accuracy 14.86% 75.4% 76.9%Sensitivity NA 69.59% 71.32%Specificity 14.86% 84.75% 85.32% |
The above table shows how the three models have performed on the test data. The elastic net regression has the highest accuracy as well as the sensitivity and the classification tree has the least specificity. From the introduction we wanted a model which have high sensitivity and least specificity and therefore based on the above findings we can select the model to use for the investigation.
Now to apply the already created model with the real data set. We first load the world data set into R studio. Thereafter, we run each of the modified model on the world data and obtain the following results.
Classification tree Binary regression Elastic net regression |
Overal accuracy 14.86% 93.66% 74.71%Sensitivity NA 94.71% 95.26%Specificity 14.86% 85.87% 34.62% |
The above table shows how the three models have performed on the world real data. The elastic net regression has the highest sensitivity whereas the binary logistic regression has the highest overall accuracy and the classification tree has the least specificity. Therefore, I would recommend the company to pick the elastic net regression since it has higher sensitivity and moderately low specificity.
References
Zheng, Y., Duarte, C. M., Chen, J., Li, D., Lou, Z., & Wu, J. (2019). Remote sensing mapping of macroalgal farms by modifying thresholds in the classification tree. Geocarto International, 34(10), 1098-1108.
Eddama, M. M. R., Fragkos, K. C., Renshaw, S., Aldridge, M., Bough, G., Bonthala, L., … & Cohen, R. (2019). Logistic regression model to predict acute uncomplicated and complicated appendicitis. The Annals of The Royal College of Surgeons of England, 101(2), 107-118.