Advanced Analytics Technique : 1253095

The report aimed at creating a business context that can be answered using advanced analytics technique. The business context can choose from variety of advanced analytics options such as data mining, machine learning and artificial intelligence. The advanced analytics technique used is to analyzed using two tools and the results of the two tools were to be compared. The advanced analytics method that was selected was logistic regression. The model was conducted using Python and R. Python produced 100 % accurate model while R produced 75 % accurate model. Therefore, Python was the preferred model to be used for machine learning.

Introduction

Analytics is one of the emerging techniques that have been used by businesses to advance their performances.  Advanced techniques use the concept of data science to create models that can be used by companies for decision making. It is a subset of data science, and it uses a high level of techniques and environments with the aim of projecting future trends, events, and behaviors. This gives businesses predictive capabilities for business activities. One of the main concepts of advanced analytics is data mining that provides the raw data which can be used both in predictive analytics and big data. Advanced analytics find the insights from the given data by using the appropriate techniques. Predictive analytics, type of advanced analytics, extract insights to extrapolate and make predictions for future activities, trends, and customer behaviors. Advanced analytics have opted for new technologies in the field of data science such as machine learning and artificial intelligence, visualization, semantic analysis, and neural networks. These techniques, combined together, help create accurate and reliable predictions as insights. Advanced analytics techniques can be deployed by manufacturers and marketing groups. Advanced analytics have been used by marketing groups to create strategies for the campaign.  Inventory and warehousing managers have also used the concept of advanced analytics to compare their current and previous sales. Therefore, they are able to track their performances, and this makes build concrete strategies to avoid wastage and increase their sales.

Business Context 

The business context that this report will focus on is the health sector. The report will aim at using a machine learning technique to classify whether the patients have heart disease or not. The data used for this model is the heart data obtained from the Kaggle website (https://www.kaggle.com/ronitf/heart-disease-uci). The data contains information about patients. The data contains about 303 observations and 13 columns. The model will opt to classify the patients from the above data to detect whether they have heart disease or not. The first column contains data about the age of the patients. The data contains information about the gender of the patients. The intensity of the patients’ chest pain and the patients’ resting blood pressure were also included in the data. The patients’ cholesterol measurements and whether the patients had fasting blood sugar were also included in the data. The intensity of resting electrocardiographic, the heart rate, whether the patients had exercise-induced angina, ST depression, the slope of peak exercise, and the patients ‘number of major vessels were also included in the data. Lastly, the blood disorder and the presence of heart disease were included in the data set. To achieve the objective of the report, a logistic regression model was used. Logistic regression is a type of supervised machine learning technique that uses the trained data to create a model that can make a prediction and classify the target variable. There are several tools that can be used to conduct the analysis, including SAS, R, Python, Rapid Miner, SPSS Miner, etc. that can handle machine learning problems. However, R and Python environment was used to create the model [20].

Logistic Regression 

Logistic regression is a type of machine learning technique that handles both regression and classification problems. Actually, it is a type of linear regression whose target variable is binary. Other linear regression techniques use the numeric variable as their target variable except for logistic regression, which uses a binary variable as the target variable [1].  The association between the outcome and the predictor variables can be determined, and also, a model can be created to predict the outcome variable using the predictor variable. The target variables can use values such as 1 or 0, yes or no, high or low, etc. The predictor variables should be more than two variables for the model to work [2]. The sample size should be significant to make the model more accurate.  Unlike linear regression, logistic regression is more difficult to interpret. It uses the concept of probability to predict the effect of the independent variables on the outcome variable [3]. On the other hand, a classification technique can also be applied using logistic regression. Classification machine learning involves several techniques, including logistic regression [19]. Before the classification model is conducted, the data is partitioned into training and testing using different proportions [4]. The proportion that is mainly used in 80 % for the training data and 20 % for the testing data. A higher proportion is required for the training data to increase the accuracy of the model [5]. This is because the training data is the one used to create the model [6]. The model accuracy can be tested by using both the training and testing data. The logistic regression was the chosen model to create the model because logistic regression has proved to create more accurate models compared to other classification techniques like a decision tree and kmeans [7]. The accuracy of the model is vital in any model because the accuracy of the model explains whether the created model is valid or not. The logistic regression model is relevant in this context because the dependent variable is a binary variable with values 0 and 1 [18]. Using the remaining variables, the logistic regression will build a model that will be able to classify the independent variables as either 1 or 0 [8]. This means that using the features of the patients, and the logistic regression will build a model that will be able to classify the patient to have a heart disease or not. The logistic regression will be able to build a model that has a better accuracy that can well be relied on.

R and Python environment

R and Python are some of the best tools that can be used in the file of advanced analytics. This is because they can handle both machine learning, analytics, and artificial intelligence problems, and this is why they are the most popular tools for data scientists. Unlike most of the other tools, R and Python are open-source, and therefore, one does not need to purchase the tools before using them.

Python

Python is structured as a programming language, while R was created for statistical analysis [11]. Python is a programming language developed in the late 80s. It is also one of the widely used programming languages used by developers and the big technology companies such as Google. Also, it has been used in YouTube application, Dropbox, etc. Python is widely used in IT businesses for multiple purposes. It supports numerous network engineers and AI bundles. One of the advantages of python is that it is a general-purpose language. It can be used in other sectors that do not involve statistics. It can be used to create an application and also for web development. Python is also easy to lean, and it has numerous documents that can be used in the learning process [12]. Also, they are multiple developers with vast python skills, and thus, one can get help quickly when learning. Python has numerous libraries that can be used for gathering and controlling information. It has libraries that can be used to conduct data mining, machine learning, and artificial intelligence problems. The package that is widely used for machine learning is Scikit-learn. Another library, known as Pandas, has been used by an individual to manipulate data. Python also has better integration. It integrates better than R. It also boosts productivity. The syntax produced by Python is exceptionally decipherable and unique, unlike other programming languages, and thus, it offers high profitability to the development groups. One of the disadvantages that Python has in the statistic is that it has very few statistical packages in comparison to R.

R programming 

R is a statistical tool developed by statisticians, and it can be used to make a statistical prediction. The language contains Mathematical concept that is derived from statistics which are used in machine learning [9]. R is the best statistical tool that can be used for analytical projects which are heavily based on statistics. R can also be used to handle machine learning and artificial intelligence projects [10].  One of the benefits of using R is that it is suitable for analysis. When the project to be handle requires data analysis or visualization, then R is considered the best tool for this. R also allows rapid prototypes, and it can be used to design machine learning models. One of the benefits of R over Python is that it has bulk useful libraries and tools. It has numerous libraries that make machine learning possible because these packages improve the performances of the models. It has packages like Caret that boost the machine learning capabilities. The advantage that developers who use R are that R has numerous packages that make their work easier when conducting the analysis. Several packages can be used for data manipulation and visualization. For instance, the dplyr package has made it possible for the developers to manipulate the data quickly and efficiently.  This is one of the advantages that R developers get over what the Python developers get. R is also the most suitable tool to handle exploratory work. EDA is one of the statistical techniques that is always conducted before carrying out any data modeling. R makes this procedure better because it can be achieved just by adding a few lines of codes. Python is easy to lean, and one can access the learning curve, but R is challenging, and it has little support from other developers compared to Python. R is inconsistent because the packages used require learning, and they don’t have detailed documentation in R. This may lead to a negative impact on the development speed when using the tool.

Both R and Python have their advantages and disadvantages. However, every tool is suitable according to the project being carried out. R performs well in projects that require data manipulation and repetitive tasks. Both R and Python conduct machine learning well and efficiently.

Results

The logistic regression technique was applied in both the two environments, i.e., R and Python. R does not require any package to create the model. It uses its base package, unlike python, which requires a package to create the model. Python uses the sklearn library and the package LogisticRegression package to create the model. R can create the logistic regression model even when there are some missing values. The missing values need to be handled in Python before building the model else the model will not be created. The logistic regression uses the normal regression equation to create its model i.e.

 Where a and b are the coefficients for the variable x and x1, respectively. However, the regression coefficients are represented in terms of the odds ratio

Results from R environment 

The results obtained from the logistic regression in R shows that the male had lower chances of having heart disease compared to the female. The patients who had asymptomatic chest pain had the highest probability of having heart diseases. The patients who had typical angina chest pain had the lowest probability of having heart disease compared to the other type of chest pain. The patients who had the highest resting blood pressure had the lowest chances of having heart disease compared to the patients who had the lowest resting blood pressure. The higher the cholesterol level of the patient, the lower the probability of the patients to have heaty disease. The patients who had fasting blood sugar had the highest probability of having heart disease compared to the patients who did not have fasting blood sugar. The higher the resting electrocardiographic measurement of the client the higher the chances of the patient to have a heart disease. The higher the heart rates of the patient, the higher the probability of the patient having heart disease. The presence of exercise-induced angina reduces the chances of the patient having heart disease. The ST depression induced by exercise relative to rest reduces the chances of the patients to have heart disease. The slope of the peak exercise ST-segment affects the possibility of having heart disease in that the downsloping peak had the highest probability of having heart disease compared to upsloping and flat. The higher the number of major vessels, the lower the chances for the patients to have heart disease. The patients who had reversible defect thalassemia blood disorder had the lowest probability of having heart disease compared to those who had normal thalassemia blood order.  The p-value is used to assess the significance of the independent variables to the dependent variable. The p-value shows whether the variables have a significant effect on the outcome variable. Sometimes it is not easy to determine which variables can greatly explain the probability of the target variable. P-value makes this possible. P-value shows the significance of the overall model, and also it shows the significance of the individual variable to the dependent variable. According to the model created in R, only seven variables showed a significant effect on the target variable. These variables include Sex of the patients, the chest pain experience, the patient’s maximum heart rate, exercise-induced angina,  ST depression induced by exercise relative to rest, the number of the patient’s maximum vessel, and thalassemia blood disorder.

Results from Python

The results obtained from Python showed that taking non-angina pain had the highest possibility of having heart disease taking asymptomatic chest pain as constant. A patient who had fasting blood sugar had a higher probability of having heart disease compared to the patients who did not have fasting blood sugar. The patients who had the normal resting electrocardiographic measurement had the highest probability of having heart disease compared to others. Patients who had fixed thalassemia blood disorder effect had the highest probability of having heart disease compared to other thalassemia blood disorders. The older patient had a higher possibility of having heart disease compared to young patients. The higher the patients resting blood pressure, the lower the possibility of the patients having heart disease. The higher the patients’ cholesterol measurement, the lower the chances for the patient to have heart disease. The higher the heart rates of the patient, the higher the probability of the patient having heart disease. The model also shows that only three variables had a significant effect on the target variable [24]. The variables were The chest pain experienced, The chest pain experienced, and The number of major vessels. It can be noticed that all the significant variables produced by python are all significant in the model produced with R. However, Python shows the significance of every factor in the categorical variables. The output obtained from Python shows the 95 % Confidence Interval for every variable in the model. This is something that has been obtained using R [21]. The overall model showed a significant effect on the target variable, i.e., p (4.36e-34 <0.05), meaning that the independent variables had a significant effect on the target variable [20]. Therefore, python produces more information about the model compared to R. Since the model was created by using the trained data, it is good to evaluate the model. Evaluating the model entails different approaches. The first approach that is considered during the evaluation is the confusion matrix [13]. The confusion matrix shows the values of the negative and true positives [22]. In short, it compares the number of original data that were correctly predicted by the model. The confusion matrix is so crucial in the file of machine learning. It shows the level of error of the prediction of the model created [14]. It shows the performance of the classification model on both the test and train data [23]. It allows the performance of the model to create to be visualized. It allows the identification of the confusion between the classes. Therefore, the confusion matrix measures the model performance by showing the misplaced classifiers [15]. Sensitivity measures the proportion of the actual positive cases that were corrected predicted as true positive. It is also known as Recall [17].  Meaning that there will other proportions of actual positive cases that are predicted to be negative. The sum of sensitivity and the false-negative rate is 1 Specificity defines the proportion of actual negative that were predicted to be negative. Meaning that there will other proportions that will be predicted to be positive that were false positive. The sum of specificity and false positive rate would always be 1 [16].

Model Validation 

The confusion matrix obtained from the R showed that out of 26 observations that had no heart disease, 21 were predicted correctly, while five were misplaced. Also, out of the 30 observations that had heart disease, 21 were predicted correctly, while nine observations were lost. Therefore, the accuracy level of the model was about 75 %. The specificity score for the logistic regression in R was obtained to be 0.7 (70 %), while the sensitivity score for the logistic regression in R was 0.8077 (80.77%).

The confusion matrix obtained from the logistic regression in Python showed that all the 27 observations that did not have heart disease were predicted correctly. Also, all the 34 observations that had heart disease were correctly predicted to have heart disease. Therefore, the model showed 100 % accuracy. The precision, recall, and f1-score were both 1 (100 %). Using these results obtained from the two tools, Python produced a more accurate model compared to R. Python trained the data and produced a model that would produce a 100 % accurate prediction. In contrast, the model produced by R would provide a 75 % accurate prediction on the heart data [23].

Conclusion 

Both Python and R are an efficient environment that can be used in an advanced analytics problem. R and Python have the necessary packages that can be used to produce a machine learning model. The model was created to classify the patients who had heart disease and those who did not have heart disease. Creating a model that will be able to classify patients that have heart disease is very important in the field of health, and this because it enables more measures put on the patients who certain features.

Recommendations

 Patients who are presented in the hospital come with different symptoms. If a model can be created that can check whether the patient has heart disease is something that can be appreciated because it reduces the time used to test all the patients who would have heart disease. Therefore, the hospital will only test the patient who has been predicted to have heart disease [25]. This effectively works when the model is highly accurate. For instance, the logistic regression model created using Python, which produced 100 % accuracy. Therefore, the Python environment has proved to be efficient enough to be used to create an advanced analytics solution. The healthcare should incorporate the uses of advanced analytic techniques such as the use of machine learning to classify the patients who have heart disease effectively. This will help them to save time and more so to save their efforts and the resources at hand. Advanced analytics has proved to be more useful in other sectors apart from the field of health but also to other fields such as aviation, manufacturing, education, etc.


References

[1] L. Jun, J. Bioucas-Dias, and A. Plaza. “Semisupervised hyperspectral image classification using soft sparse multinomial logistic regression.” IEEE Geoscience and Remote Sensing Letters 10.2 (2012): 318-322.

[2]L.Dun, L.Tianrui, and L. Decui. “Incorporating logistic regression to decision-theoretic rough sets for classifications.” International Journal of Approximate Reasoning 55.1 (2014): 197-210.

[3] D. Iswar, et al. “Landslide susceptibility assessment using logistic regression and its comparison with a rock mass classification system, along a road section in the northern Himalayas (India).” Geomorphology 114.4 (2010): 627-637.

[4] L.Jing, et al. “Validation and simplification of the Radiation Therapy Oncology Group recursive partitioning analysis classification for glioblastoma.” International Journal of Radiation Oncology* Biology* Physics 81.3 (2011): 623-630.

[5] P. Nicolas, et al. “Semi-supervised knowledge transfer for deep learning from private training data.” arXiv preprint arXiv:1610.05755 (2016).

[6] Hartmann, Alfred C. “Filtering training data for machine learning.” U.S. Patent No. 7,690,037. 30 Mar. 2010.

[7] M. Pennacchiotti, Marco, and P. Ana-Maria. “A machine learning approach to twitter user classification.” Fifth international AAAI conference on weblogs and social media. 2011.

[8] S. Suthaharan. “Big data classification: Problems and challenges in network intrusion prediction with machine learning.” ACM SIGMETRICS Performance Evaluation Review 41.4 (2014): 70-73.

[9] B. Lantz. Machine learning with R: expert techniques for predictive modeling. Packt Publishing Ltd, 2019.

[10] F. Pedregosa, V. Gaël et al. “Scikit-learn: Machine learning in Python.” Journal of machine learning research 12, no. Oct (2011): 2825-2830.

[11] K. Konstantina et al.. “Machine learning applications in cancer prognosis and prediction.” Computational and structural biotechnology journal 13 (2015): 8-17.

[12] V. Sofia, R. Brian, R. Anca , and E. Van Der Knaap. “Confusion Matrix-based Feature Selection.” MAICS 710 (2011): 120-127.

[13] M. Nadav David, L. Rokach, and A. Shmilovici. “Using the confusion matrix for improving ensemble classifiers.” In 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel, pp. 000555-000559. IEEE, 2010.

[14] J. Maroco et al.. “Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests.” BMC research notes 4, no. 1 (2011): 299.

[16] Y. Mitchell et al. “Unsupervised machine-learning method for improving the performance of ambulatory fall-detection systems.” Biomedical engineering online 11, no. 1 (2012): 9.

[17] A. Hiba et al.. “Using machine learning algorithms for breast cancer risk prediction and diagnosis.” Procedia Computer Science 83 (2016): 1064-1069.

[18] Osborne, J.W., 2017. Simple Linear Models With Categorical Dependent Variables: Binary Logistic Regression.

[19] Dou, Jie, et al. “TXT-tool 1.081-6.1 A comparative study of the binary logistic regression (BLR) and artificial neural network (ANN) models for GIS-based spatial predicting landslides at a regional scale.” Landslide dynamics: ISDR-ICL landslide interactive teaching tools. Springer, Cham, 2018. 139-151.

[20] Wayant, C., Scott, J. and Vassar, M., 2019. Lowering the P Value Threshold—Reply. Jama, 321(15), pp.1533-1533.

[21] Lee, D.K., 2016. Alternatives to P value: confidence interval and effect size. Korean journal of anesthesiology, 69(6), p.555.

[22] VanderWeele, T.J. and Ding, P., 2017. Sensitivity analysis in observational research: introducing the E-value. Annals of internal medicine, 167(4), pp.268-274.

[23] Kyriacou, D.N., 2016. The enduring evolution of the p value. Jama, 315(11), pp.1113-1115.

[24] Jeon, M. and De Boeck, P., 2017. Decision qualities of Bayes factor and p value-based hypothesis testing. Psychological Methods, 22(2), p.340.

[25] Ioannidis, J.P., 2018. The proposal to lower P value thresholds to. 005. Jama, 319(14), pp.1429-1430.