Data Mining with R: An Applied Study

Purpose – The aim of this study is to analyze different classification algorithms with R programming and to determine the accuracy rates. It also encourages the use of the R program by giving readers the opportunity to experiment


INTRODUCTION
One of the biggest problems that arise today with developing technology is the information stored. Technological advances in data storage methods cause the stored information to grow exponentially. The fact that the Internet is an integral part of daily life affects this growth. E-mails, web page records, training notes, market sales data, hospital records and results, bank transactions, social media accounts, sports competitions are produced in every second Almost all of this data is recorded electronically. The answer to the question of how, when and for which stored information will be used is within the scope of data mining. Data mining is the process of conducting analyses by obtaining useful information from a large stack of data. With this analysis, it is aimed to reveal meaningful information and relations and to reveal data patterns. Data mining also allows for new estimates based on historical data. Data mining methods used for these purposes are classified under three main headings as classification, clustering and association analysis.
In this study, classification method which is frequently encountered in literature is discussed. The classification method is one of the most up-to-date alternative methods that offer more practical and faster solutions than many other algorithms. In the classification method, a model is established with the help of various algorithms based on common features, differences, ratings or groupings within the data. In order to construct models within the scope of classification, algorithms based on many different theories have been developed. The theoretical structure of each of these algorithms is mathematically different. If you want to look roughly, the statistical basis of these algorithms is; decision trees, regression analysis, logistic functions and extensions, bayes theory, neural networks.
There are several methods for measuring the validity of the classification model. Among these methods, accuracy, sensitivity, accuracy and error rate are the most popular. These criteria are calculated using the equations explained in the second section. These calculations are based on statistical calculations such as correct estimation rate, correct classification rate, wrong classification rate. Therefore, these criteria are known as the most commonly used criteria in the literature. In addition, the success of the model is explained by correctly classified observations and incorrectly classified observations. For this purpose, the information obtained from the test is indicated by the confusion matrix. Table 1 shows an example confusion matrix. In this table, 16 observations of class "a" were correctly estimated and 3 of those appearing in class "b" were incorrectly estimated. Information criteria can be easily calculated via the confusion matrix. For example, the accuracy rate can be easily calculated on the Table as the ratio of correctly classified observations to all observations. Data mining analyses are generally performed with the help of programs on computer. There are many programs developed in the literature for classification analysis. The most preferred are open source programs such as Weka, SPSS, Knime, R, Oracle. In recent years, the widespread use of software in academic circles has increased the use of programs. The R program, which is frequently preferred in the field of statistics, is also affected by this increase. The R-programming language is user-friendly, providing advantages in many areas. R programming language, which has an important place in data mining, is used in the analysis of classification algorithms.
As mentioned, there are many algorithms and programs developed for classification purposes. Simultaneous analysis of all algorithms is not practical. It is also known that some algorithms are developed and prepare the ground for other algorithms. Therefore, it is more meaningful to consider algorithms that are based on newly developed and more robust mathematical foundations. Nevertheless, it is clear that there are many 204 algorithms to be examined. In this study, J48, Random Forest algorithms and Naïve Bayes algorithm based on probability tree structures are discussed. The study is based on three basic steps in order to investigate how the R program can be used to classify the data: 1. examining the theoretical information about the algorithms to be discussed in the study, 2. conducting classification analysis through R program, and 3. interpreting the results and evaluating the contribution of the study.
The first step is the material and method stage. At this stage, the structure of the algorithms and the model performance criteria are examined. The data sets used are also introduced at this stage. The second step involves the analysis of 3 different data sets (liver, lenses, wine) in the R program with the aforementioned 3 different classification algorithms. The procedures are explained step by step. The program outputs are given as they are in order to clearly see the results of the analysis. Also, in this step, all R codes are presented to the reader and they are given the opportunity to experiment. In the third and last step of the study, the algorithm results were compiled collectively with the help of tables. The compiled results were interpreted to explain the contribution of the literature and the study was completed by presenting suggestions.

LITERATURE REVIEW
When the literature is reviewed, it is seen that data mining emerged conceptually in the 1960s when computers were used to solve data analysis problems (Han, Kamber, & Pei, 2012). Data mining, which was called as data scanning in the first days, has reached the present term with the consideration of computer engineers. In the 1990s, traditional statistical methods were abandoned, and data analyzes were evaluated with the help of computer modules. However, these modules were very difficult to use and required significant data preparation (He, 2009). This has led researchers and especially computer engineers to develop new modules. Looking at the interfaces that can be used for today's data mining analysis, it is seen that some of them are developed commercially and some of them are offered as open source. SPSS, MATLAB, Oracle and Weka, R, Knime, RapidMiner are examples (Kaya & Özel, 2014). Many studies conducted under the name of data mining are available in the literature. The rapid development of computer technologies and the ease of data acquisition and storage increase the importance of data mining and thus push researchers to work on this issue. Alfaro et al. (2013) conducted a study with adabag which is a classification package in R program. They showed applications for the three data sets in the literature and as a result discussed the similarities and differences of the three different algorithms. Zhang (2016) conducted a classification study with Naïve Bayes. In this study, it has clearly explained how and with which packages the classification is used in R. Kızılkaya and Oğuzlar (2018) compared the performance of logistic regression and decision tree 205 controlled learning algorithms with R language. They stated that logistic regression yielded the most successful result according to sensitivity criterion. Goswami et al. (2018), in their compilation studies on the application of data mining techniques, found that there are not enough resources for natural disaster detection especially in the Indian region. This study reveals the necessity of data mining in combating natural disasters. Çınar (2019) determined the performance of C5.0 and Gini classification algorithms in determining students' learning levels by using R language. C5.0 algorithm showed better results.
This study promotes data mining using R and aims to analyze different classification algorithms with R programming and determine accuracy rates. In the current literature, there are many studies or applications about data mining and its applications. However, to the best of our knowledge, there are few studies that clearly show how the classification is made with the R program, which allows many analyzes especially in recent years. In this study, analysis steps are given in addition to the existing studies. Thus, it is thought that especially young researchers will gain habits such as experimenting, selflearning and reading the results.

J48 Algorithm
One of the most well-known decision tree algorithms, J48 is the Weka equivalent of the C4.5 algorithms. The J48 algorithm is also known as ID3. In this algorithm, the entropy and information gain values for the target class are calculated using equations 1 to 3. The expected information needed to classify a tuple in D is given by (Han et al., 2012):

Equation 1
A kind of normalization is applied to the gain of knowledge by using the "split information" value defined similarly to entropy.

Equation 2
206 This value represents the potential information generated by dividing D into sections v corresponding to the v results of an A test. For each result, the number of tubes that achieve this result is taken into account. Gain ratio in this case:
Entropy indicates the likelihood of an unexpected situation (Bhargava et al., 2013). Information gain values show information values calculated for each attribute from the entropy value. In J48 algorithm, decision tree is created starting from the variable that gives the highest information. Processing is complete when all variables are included in the tree (Patil & Sherekar, 2013).

Algorithm Steps:
 Entropy and related information gain values are calculated.  Features that give the best information gain are added to the decision tree. The best feature creates the base node.  After the calculation of all features and branching of the decision tree, the model installation is completed (Kaur & Chhabra, 2014).

Random Forest Algorithms
It is one of the most preferred decision tree algorithms for classification problems (Eraldemir, Arslan, & Esen, 2017). Multiple decision trees are generated for the classification process and then random forests are generated. Because of the high number of decision trees created, the classification success is high.

Algorithm Steps:
 The feature that provides the best classification is selected and the starting node is created.  A training set is formed with a part of the data set. The remaining data is the test set.  Trees are created with the number of variables to be used in each node and the numbers of trees in N. Variables are selected randomly at each node.  When N trees are produced, the model is completed and the class of the new member is estimated (Akar & Güngör, 2012).

Naïve Bayes Algorithms
It is a probabilistic method based on Bayes' Theorem. It is named after the famous mathematician Thomas Bayes. In this method, probability values are calculated from the observed properties and classification is made. It equalizes the probability value to "0" if there is an incalculable or unobservable value. Bayes' Theorem is expressed by Equation 4 (Odabaş, 2017).

Model Performance Evaluation
There are many criteria used to evaluate model performance. Accuracy, error rate, precision, and sensitivity are the most important ones. In this study, accuracy values were taken into consideration. The accuracy value is calculated by the ratio of the number of correctly classified observations to the total number of observations. This criterion refers to the capability of the classifier. In other words, the fact that this criterion value is high and acceptable shows the applicability of the model in the classification of new observation values. Therefore, accuracy values were taken into consideration in the study. Thus, the predictor will be a good predictor for new observations (Han et al., 2012).

Data Sets
In this study, different data sets were obtained from UC Irvine Machine Learning Repository (2019) suitable for classification procedures. When selecting data sets, it was taken into consideration that they have different attribute characters and they do not contain missing data. Thus, it was predicted that the unpredictable values would not remain and that the model would be established healthier. Information on these data sets is given in Table 2.

APPLICATION WITH R
Before proceeding to the data mining stage with R, the packets given in Figure 1 must be available in R. Other required packages are installed automatically in these packages.

208
install.packages("plyr") install.packages("caret") install.packages("RWeka") install.packages("partykit") install.packages("randomForest") install.packages("e1071") Figure 1. R Packages Required for Classification Analysis Data set and R program were prepared for the study. The data sets were then evaluated with the help of classification algorithms. The application consists of three steps. The first step is to prepare the data and transfer it to the program. The second step is to analyze with classification algorithms and the third step is to evaluate and compare the results. The classification analysis steps for the "wine" data set are described in detail in this section. Other data sets were analyzed with the same codes. The results are presented for discussion.

Classification with J48 Algorithm
In this section, analysis is made with J48, which is one of the decision tree algorithms. The analysis steps for the training set calculated with Figure 3 are given in Figure 4. According to the flow chart of the algorithm, flava feature gives the highest information gain in tree formation. Therefore, it is determined as the first property. According to the results, the number of leaves is 5, the size of the tree is 9. When the Confusion Matrix is examined, there are two misclassified observations. Correct classification rate of the algorithm is 98.41%, Kappa statistical value is 0.97 and mean square root error is 0.10. Figure 5 shows the decision tree structure for the J48 algorithm.
The results for the test data set are in Figure 6. When the Confusion Matrix of the test data set is examined, it is seen that there is no misclassified observation.

Classification with Random Forest Algorithm
The classification results for this multi-tree algorithm are given in Figure 7. In the model created with the training set, four observations were misclassified. When the test set results are examined, it is seen that the algorithm has 1 accuracy rate for this data set.

RESULTS
In this study, the stages of data mining classification algorithms are shown by using R over the "wine" data set frequently used in the literature. The main purpose of the study is to encourage readers to analyze with R and to present the application of basic classification algorithms. For this purpose, three commonly used algorithms in the literature have been selected: J48, Random Forest, Naïve Bayes. Two points are emphasized when selecting algorithms. First, the algorithm structures are based on different mathematical foundations. The second is the frequency of researchers choosing this algorithm. Throughout the study, the analysis of "wine" data was explained to the reader. However, in the background of the study, "liver" and "lenses" data sets were also analyzed. When selecting these datasets, it was considered that they do not contain missing data and that they have different attribute characters.
After the data sets were ready for analysis, the application phase of the study was started. First, 70% of the data set was identified as training set and 30% as test data (Figure 3). Then the actual analysis was done. For this, j48, Random Forest and Naïve Bayes algorithms were selected. Data sets were classified with the help of these algorithms. The 'accuracy' criterion was chosen to interpret the results. This criterion is chosen because it gives accuracy as mentioned before. Obviously, it shows the ratio of how accurately a new observation is predicted.
In the last step of the study, the results were examined and comments were made. The accuracy rates for the three data sets are given in Table 3. Looking at the "wine" data, the performance of all three algorithms is quite successful. The results for the other two data sets (Lenses and Liver) are in the parallel. Only the "liver" dataset gave a slightly lower accuracy than expected with the Naïve Bayes algorithm (0.55).

DISCUSSION, CONCLUSIONS AND RECOMMENDATIONS
Classification analysis is of great importance in both statistical and interdisciplinary analysis for reasons such as discovery of connections in the data set, identification of relationships, patterns between features, and recognition of the data set for meaningful analysis. For this reason, many researchers have made studies in this field. When the studies were examined, it was thought that the number of data mining analyzes with R program was low. For this reason, in this study, performance comparison of algorithms has been made within the scope of data mining with R program. The accuracy rate was taken as a criterion. All codes are given with their outputs in order to be an example especially for young researchers or students.
It is thought that this study can be a source for other researchers, will encourage the use of R and the researchers or students will try new papers by trying the codes. In subsequent studies, a similar study can be done by developing the given codes. Or how to make classification analysis in R with different algorithms can be examined.