Weka
1 Introduction
Main objective of this project is analysis the provided data file by using the data mining tools. This project divided into five tasks such as data acquisition, data pre-processing, mining tool preparation, clustering analysis and visualization. In data acquisition, user needs to download the project data file like Ebola Discussion. In Data pre-processing, user needs to extract the substring on the each field. This process is used to preserve the data mining analysis and it improve the performance like tokenization, steaming, name entity recognition and stop word removal. It also impute the missing values in the each fields. In mining tool preparation, user needs to download and install the Weka explorer. After, install the Explorer. Then, open the provided data file. Finally, remove the attributes or fields that user think are not meaningful for pattern analysis. In clustering Analysis, user needs to cluster the provided data file. Finally, user needs to provide the visualization of the provided data file. These are will be discussed and analysed in detail.
2 Data Acquisition
In data acquisition, user needs to download the project data file like Ebola Discussion. The Provided data file is illustrated as below (Han, Kamber & Pei, 2012).
3 Data Pre-processing
In Data pre-processing, user needs to extract the substring on the each field. This process is used to preserve the data mining analysis and it improve the performance like tokenization, steaming, name entity recognition and stop word removal. It also impute the missing values in the each fields. The provided data file is successfully completed the data pro-processing process (Hancock, 2012). It is illustrated as below.
4 Mining Tool Preparation
The Weka is one of data mining software which is used to provide effective data mining process and it uses a collection of machine leaning algorithms to provide the effective mining process. Weka is a collection of tools for:
- Regression
- Clustering
- Association
- Data pre-processing
- Classification
- Visualisation
Here, user needs to download and install the Weka explorer. After, install the Explorer Men. Then, open the provided data file. It is illustrated as below (Mitsa, 2010).
Finally, remove the attributes or fields that user think are not meaningful for pattern analysis by using the below steps. Choose Filter to apply the String to Word Vector, for transforming MESSAGE string into a vector of words. It is illustrated as below.
The removing the attributes or field results is shown below (Spendler, 2010).
5 Clustering Analysis
The cluster analysis is used to identify the occurrences groups and similarities within the provided data file that is Ebola Discussion. Basically, the cluster analysis uses the training set, percentage split, and classes and supplied set. Also, clustering analysis has options to ignore the some attributes the from the provided data file based on the requirements. The clustering algorithms has the following schemes such as farthest first, x-means, EM, K-Means and cobweb. Here, we are using the k-Means analysis to analysis the Ebola Discussion data file. Generally, the clustering allows a user to create the groups of data to determine the data patterns on the given data file based on the project requirements. The clustering has one defining benefit compared to the classification is that every attributes are used to analyse the provided data (Stahlbock, Abou-Nasr & Weiss, 2018).
In clustering Analysis, user needs to cluster the provided data file by using the below steps.
First click the cluster to choose the simple K means clustering algorithm. It is illustrated as below.
K Means
======
Number of iterations: 2
Within cluster sum of squared errors: 60131.00000000001
Initial starting points (random):
Cluster 0: ‘A professor in U S is telling Liberians that the Defense Department manufactured Ebola _URL_ via’,’Mon Sep 29 13:51:10 +0000 2014′
Cluster 1: ‘Goodluck Jonathan We Conquered Ebola We ll Crush Boko Haram President says President Goodluck Jonathan sai _URL_’,’Mon Sep 29 12:35:57 +0000 2014′
Missing values globally replaced with mean/mode
Final cluster centroids:
Time taken to build model (full training data) : 0.07 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 30434 (100%)
1 4 (0%)
Visualization of K Means is illustrated as below.
The K means results windows is used to display the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster. Thus, centroids can be used to characterize the clusters. Finally, we want to adjust the attributes of our cluster algorithm by clicking Simple K-Means. The output of simple K means algorithms shows the cluster 0 and cluster 1. The cluster 0 is used to shows the A professor in U S is telling Liberians that the Defense Department manufactured Ebola _URL_ via and the cluster 1 is used to shows the information about the Goodluck Jonathan We Conquered Ebola We ll Crush Boko Haram President says President Goodluck Jonathan sai _URL_. Each cluster shows us a type of behaviour in provided data file. The evaluation of training set is provided the following results (Veart, 2013).
Clustered Instances | |
0 | 30434 (100%) |
1 | 4 (0%) |
6 Visualization
Visualization of provided data file is illustrated as below.
7 Conclusion
This project successfully analysed the provided data file by using the data mining tools. This project divided into five tasks such as data acquisition, data pre-processing, mining tool preparation, clustering analysis and visualization. In data acquisition, user successfully downloaded the project data file like Ebola Discussion. In Data pre-processing, user effectively extract the substring on the each field. This process is used to preserve the data mining analysis and it also improve the performance like tokenization, steaming, name entity recognition and stop word removal. It also impute the missing values in the each fields. In mining tool preparation, user successfully downloaded and installed the Weka explorer. After, installed the Explorer. Then, open the provided data file. Finally, removed the attributes or fields that user think are not meaningful for pattern analysis. In clustering Analysis, user effectively cluster the provided data file. Finally, user effectively provided the visualization of the provided data file. These are discussed and analysed in detail.
References
Han, J., Kamber, M., & Pei, J. (2012). Data mining. Waltham: Morgan Kaufmann.
Hancock, M. (2012). Practical data mining. Boca Raton, FL: CRC Press.
Mitsa, T. (2010). Temporal Data Mining. Hoboken: CRC Press.
Spendler, L. (2010). Data mining and management. New York: Nova Science Publishers.
Stahlbock, R., Abou-Nasr, M., & Weiss, G. (2018). Data Mining. Bloomfield: C. S. R. E. A.
Veart, D. (2013). First, Catch Your Weka. New York: Auckland University Press.