Monitoring Class Activity and Predicting Student Performance Using Moodle Action Log Data

Purpose – This paper proposes a novel approach for processing course log data obtained from Moodle-based blended courses in order to visualize patterns of student activity within the online environment and to determine whether these log data can be used to predict student academic performance. Method – Logs of student activities were summarized and processed using the Vector Space Model approach. This resulted in a novel vector-based form of representation which can be used to map students’ activity in a latent activity space given a set of activity dimensions. An enriched form of this representation was also generated by processing the DateTime and IP address metadata for the purpose of developing classification/predictive model of students’ performance.


INTRODUCTION
Blended Learning (BL) has become popular in the last few years, spurred by the widespread use of the web and the opportunities and conveniences that it provides (Norberg, Dziuban, & Moskal, 2011).Wikipedia defines BL as "an education program (formal or non-formal) that combines online digital media with traditional classroom methods.It requires the physical presence of both teacher and student, with some element of student control over time, place, path, or pace" (Wikipedia, 2017).By combining face-to-face with online learning techniques, BL is virtually considered as being capable of accommodating the various learning strategies and styles of students.This perceived benefit convinced many Higher Education Institutions (HEIs) in the Philippines to start adopting blended learning strategy in their curriculum.By providing students with the ability to independently access learning resources "anytime, anywhere", BL is assumed to create an advantageous environment that can remove many barriers to enhancing student performance, while promoting high-quality interactions between faculty and students (Nicdao, 2013;Kanuka, Brooks, & Saranchuck, 2009).
Learning Management Systems (LMS) is a key feature of blended learning (Dias & Diniz, 2014).Specifically, in the Philippines, the Modular Object-Oriented Developmental Learning Environment or Moodle LMS (Rice, 2006) is commonly used to support the blended learning setup not only because it is cost effective but also because it provides sufficient features to enable HEIs to create flexible online learning environments.These environments not only allow students convenient access to educational resources, it also provides HEIs the opportunity to collect vast quantities of data on students' activities.These data offer rich potentials in studying student behavior and can also help determine whether there are patterns that lead to better success in learning.An example of this is the action logs recorded in Moodle.This log maintains six data dimensions describing how students interact with the online environment (see Table 1).
Taking advantage of this data, however, is neither simple nor straightforward due to its massive volume and high rate of velocity.Most often than not, assistance from specialized tools is needed to extract useful information for tracking and assessing the activities performed by students, especially in cases where the faculty member is handling multiple classes.At the same time, although the Moodle system provides some reporting tools, it does not provide specific features which can enable educators to directly monitor and evaluate the activities of students in relation to the structure and contents of the course and how it affects the learning process (Zorrilla, Millan, & Menasalvas, 2005).Based on experience, this keeps instructors from making meaningful sense and use of this data (Estacio & Raga, 2017).This paper describes a pilot study that discusses and illustrates the use of a novel approach in analyzing the log data generated by Moodle in a blended learning context.The proposed technique can be used to process and break down the multidimensional log data collected by the LMS in order to generate graphical representations that provide a profile of students' activities online, both individually and within a group.This can help to reveal some of the students' adopted self-regulatory learning strategies as reflected by the differences in the way they utilize features of the LMS.The study also attempts to determine whether the same data can be used to predict student course performance through the use of various data mining techniques.Experiments were conducted comparing several classification algorithms using a feature-enhanced version of the same data used in the previous task.The aim of this experiment will be to determine if there is an overall ideal set of data attributes that can be used to predict and anticipate student academic performance in blended courses and which algorithm is best suited to process these attributes.The study is being conducted in support of a Course Redesign Program (CRP) of a University whose goal is to redesign instructional approaches by integrating elearning technology into traditional classes to achieve quality enhancements.

THE PROPOSED APPROACH
The proposed approach for addressing the issues cited above combines techniques borrowed from the fields of Information Retrieval (IR) and Data Mining (DM).In particular, the concept of Vector Space Model (VSM) and Classification techniques are discussed in succeeding sections.

IR and Vector Space Model
VSM is a statistical model of representation often used in the field of Information Retrievalfor processing text documents (Singhal, 2001).The main idea behind VSM is to construct a vector of terms representation for documents and use these to compare the contents of documents in a latent semantic space.Recently, there has been some progress on utilizing VSM for purposes outside the field of IR.Sreeja and Mahalakshmi (2016), for example, explored the use of VSM to automatically detect emotions in English poems.Fraser and Hirst (2016) investigated using VSM to detect language impairments among people with Alzheimer's disease, and Younge and Kuhn (2016) used VSM as a measure to detect patent similarity.Salehi, Pourzaferani, and Razavi (2013), in an attempt to provide students with a tool that can be used to cope up with the ever-increasing numbers of learning materials in the web, also developed a hybrid recommender system that locates suitable learning materials and delivers them to learners based on their specific attributes.In the same manner, in this study, VSM is applied to activity data generated within the blended learning courses to determine whether it can enable instructors to overcome the voluminous amount of data and be able to use these as a guide in providing formative feedbacks and/or in adjusting pedagogical strategies.
In traditional VSM, if terms are represented using words, then every word in the document is treated as an independent dimension in the vector representation and the value assigned to each dimension is the number of occurrence of each unique word in the document.Using this approach, any document can be represented by a vector, and thereafter, plotted and compared in a multi-dimensional semantic space.To compute document similarity in this space, the angle produced between their representative vectors can be measured using the cosine distance formula (see Equation 1).This returns a value between zero and one.The higher the value, the more similar the documents are assumed.

Classification Techniques
Classification is a data mining task used to analyze the attributes of data items in a collection with the end goal of assigning individual items to target categories or classes (Ahmed & Elaraby, 2014).The classification process involves two phases, the initial learning phase and the subsequent classification phase (Baradwaj & Pal, 2011).During the learning phase, a selected set of training data with corresponding class labels are first analyzed by a classification algorithm to develop a classification model.In the succeeding classification phase, sample test data (without their class labels) are inputted to the generated classification model to determine how well the model can predict the target categories or class labels of each item.The percentage of correct class labels outputted by the model is then used to estimate the accuracy of the model.If the accuracy is acceptable, the model is deemed fit to be applied to new and real-time data sets.
Applying classification techniques to predicting student performance is a challenging task.An initial key problem that needs to be addressed is identifying the most suitable method for predicting the performance (Shahiri & Husain, 2015).Following Kabakchieva (2013), several learning algorithms were used in this experiment.These are generalpurpose learning algorithms covering different paradigms.They were selected because of their availability in the Scikit-learn machine learning library of Python (Pedregosa et al., 2011) and were used with default parameter settings: (1) Logistic regression is a predictive modeling approach used to quantify the degree of relationship between a dependent variable and one or more independent variables.(2) Linear discriminant analysis (LDA) is a method that can be used to classify a data object into one of several classes by finding the linear combination of features that characterizes the different classes.It is closely related to regression analysis (Xanthopoulos, Pardalos, & Trafalis, 2013).(3) kNN is an instance based algorithm used to classify a data object by applying a majority voting mechanism among its nearest neighbors in a feature space.(4) CARTis a regression-based predictive model that generates a decision tree.
(5) Random Forestis an ensemble type of learning algorithm that can classify data objects by constructing several decision trees during training time and then using the mean prediction of the individual trees as decision output.(6) BayesNet is an algorithm that can be used to represent probability distributions using a network of nodes.( 7) SVM is a supervised machine learning technique often used for classification.It operates by finding a maximizedhyperplane that can segregate two classes effectively.

Analysis model
The proposed analysis model consists of three stages: (1) collection and preprocessing of action logs, (2) application of VSM representation to generate activity space visualization, and (3) training of classification model and producing course performance predictions (Figure 1).The analysis process focuses first on the action dimension.Table 2 shows the initial set of action types examined in this study.These actions were selected because they represent the various activities that the students most often engaged with inside Moodle.The collected records were pre-processed by anonymizing specific student information.To represent class activity using VSM requires the construction of vectors that represents the activity of each student.This activity vector can be defined as simply a list of action types with their corresponding values depicting how many times each action was initiated by the student.For instance, Figure 2 provides a sample matrix depicting a set of activity vectors for five students.Here, the values in each element represent the number of times each studentperformed such action.As such, a value of zero means that the action type was not performed at all.

Figure 2. Student Activity Vectors
Following the semantic space analogy, each activity vector can serve as a coordinate that can be used to plot students in a 3-dimensional space where each dimension is notionally assigned to each type of activity.Figure 3 illustrates this space along with student vectors plotted in it.
This representation can be used to compare students' activity with each other and/or to measure how much students implicitly prefer a certain type of activity within the environment (e.g., engagement dimension).In this paper, the latter approach is explored.Notice that in Figure 2, the action types in the activity vector are ordered based on the type of activity to which the action type belongs (e.g., the first 3 columns belong to content access, the next set belongs to forum engagement, and so on).This coding enables one-hot encoding representation for each activity dimension to be constructed.This can be done by setting the action types for a specific activity to a non-zero value (i.e., one) while the rest of the action type values are set to zero.Thus, the representative vector for each activity dimension would be as shown in Table 3.

Assessment
Measuring each student's level of activity relative to each dimension then simply requires applying the cosine formula between the students' activity vector and the onehot encoding representation vector for each dimension.This process provides cosine scores for students representing the level of activity of each class for each dimension.

Table 3. Representative vectors for each activity dimension
Activity Dimensions Representative Vectors Content 1 1 1 0 0 0 0 0 0 Engagement 0 0 0 1 1 1 0 0 0 Assessment 0 0 0 0 0 0 1 1 1 The final step applies classification techniques to analyze the data further and to test how good these algorithms can predict student course performance.For this purpose, the vector structure is first subjected to a data transformation process which aims to enrich it with additional data elements extracted from the Time and IP Address dimensions of the log data (see Table 1).The main idea for this enrichment process is to include as many attributes of the action log as possible into the vector structure, and then, later on, to test the strength of these attributes as predictors of student's academic performance.
The DateTime stamp was first processed by separating the Date and Time values.The date values were then used to compute the total_days_span (TDS) index.This was done by counting the total number of days elapsed between the first and last dates that the student logged an action in the system.Then, the total_access_days (TAD) index was measured by counting the total unique number of days that the student logged-in an action.The two index values were then combined to produce the access_density_score (ADS).The proposed formula used to define ADS is shown in Equation 2.

ADS = TanH(TDS/TAD)
The ADS expresses a scaled ratio of TDS and TAD.This scaled ratio is proposed to objectively rank the amount of effort exerted by each student in conducting activities within the online environment.For example, some students may incur the same number of access days, but if one student has a lower number of days spanned in using the system then the ADS will be assigning him a higher score value for his effort.ADS values are between -1 and +1, as the students exert more effort in accessing the system on a daily basis the ADS value approaches +1.Comparing the ADS of students serves to highlight the effort profile exerted by students in accessing the system.
The time metadata of all the actions incurred by the students, on the other hand, were grouped into four different categories, namely: (i) AM+, (ii) AM-, (iii) PM+, and (iv) PM-.The basis for assigning a particular time stamp in each category is provided in Table Equation 2 4.This grouping serves to highlight the access time profile of students in accessing the system.For processing the IP address stamp, another grouping based on the known Network ID (NID) of computers located inside the University was used.The NID is the leftmost numeric label in the IP address used to identify computers in a network.The designated NID of computers inside the target University is 168; therefore, any action whose NID is not 168 was initiated using devices outside the university premises.We grouped all actions between those incurred inside the university and those incurred outside the university.This grouping serves to highlight access location profile of students in accessing the system.
Finally, the final grade achieved by each student in the course was added as a final attribute.To generate a categorical class label for each student, the final grades were classified as to whether they are High, Low, or Failed.

Activity visualization
Initial experiments were conducted exploring more than 285,000 action log data generated by 885 students in three different blended courses: Basic Computing (CSC16), Engineering Management (EGR36), and Elementary Statistics (MAT22).Figure 5 (graphs  A-C) shows the results of applying VSM to activity vectors generated using this dataset.Each point in these graphs represents a student's cosine score for each activity dimension per class.Although students are anonymously depicted, the graph clearly indicates the overall degree of activity among the cohorts.The visualizations also indicate that different classes vary widely in how they utilize the tools provided within Moodle.
However, within a certain class, students take on a similar set of behavior in allotting time and effort between different tools and activities.Students from CSC16, for example, generally log-in into Moodle to access lecture materials with little regard for forum engagement.MAT22 students display an equal level of preference for accessing lecture materials and taking assessment activities with some forum discussions initiated, whereas EGR16 students seem to prioritize content access and forum engagement over access to assessment tasks.These visualizations can help course administrators in determining the type of strategic interventions that each class/course would need to ensure that student's activities are kept in line with the intended pedagogical outcomes.
(A) (B) (C) Figure 5 (A-C).Graphs depicting student activity in different courses

Performance Prediction
In performing performance prediction the enriched data was first subjected to another round of pre-processing in order to identify and remove collinear attributes.First, attribute columns with all zero values was deleted from the data matrix, these include the "assign view submit assignment form", "quiz review", and "user view" attributes.Correlation analysis was then applied to every pair of attributes using Microsoft Excel.An absolute value of 0.25 was used as a threshold for identifying attributes with no significant correlation with the student's Final Grade (FG attribute).These attributes, which was subsequently removed from the dataset, includes the "course recent", "forum add discussion", "forum view discussion", "forum view forum", "URL view", "user view all", "AM+", and "TDS/TAD" attributes (marked in yellow in Figure 6).
Regression analysis was then applied to the resulting dataset to further process the attributes and identify strong predictors of the student's performance.This was done using Microsoft Excel, results of the analysis using are shown in Figure 7.A value of p < 0.05 was used to test for a significant relationship with the FG attribute.Out of the 16 remaining attributes, 9 were eliminated based on the regression results.These attributes include the following: "quiz attempt", "quiz close attempt", "quiz view", "resource view", "total", "AM-", PM+", PM-", and TAD.These attributes are highlighted in yellow in Figure 7.
Finally, using Python version 3.6.2,several classification algorithms available in the Scikit-learn machine learning library was applied to the enriched data representation to determine how well these algorithms can predict student performance given the available amount of data.The remaining attributes used as predictors in this experiment with their corresponding p-values are shown in Figure 8.  Table 5 provides the results of the prediction experiments for each group of students.Default parameter settings for the algorithms were used for these experiments along with a k-fold cross-validation where a value of k=10 was used.As shown, the overall best accuracy rating was obtained using the k-nearest-neighbor algorithm with an average accuracy of 72.8%.But the best accuracy per course was generated by the LDA algorithm for the Engineering course with an accuracy of 87.7% using the EGR36 course dataset.These initial results indicate that the classification algorithms can modestly predict student performance.These modest performances can be due to the fact that there are other factors that can affect the performance of students beyond their study skills (Robbins et al., 2004).This is especially true in in blended courses, where students are immersed in two different learning environments.However, if investigated further, this sort of information could possibly provide a basis for identifying at-risk students and enabling instructors to provide effective formative feedbacks as early as possible.

DISCUSSION AND FUTURE WORK
This paper proposes a novel approach for processing course log data obtained from Moodle-based blended courses in order to visualize patterns of student activity and to determine whether these log data can also be used to predict and anticipate the academic performance of students.Logs of student activities were summarized and processed using the Vector Space Model approach.This resulted in a novel vector-based form of representation called an Activity Vector which can be used to map and understand the positioning of each student in a latent Activity Space.Results clearly indicate that the activity space coupled with a one-hot vector representation for each unique activity dimension can be used to visualize the differences in level and type of activity preferences of students both individually and per class.In the long run, these types of visualizations could be used to monitor which student and/or class requires immediate and specific pedagogical adjustments.
Much work, however, needs to be done in terms of refining the process applied to the data.In particular, the log data should be time-sliced and processed on a per period basis in order to determine whether and how the student's level and type of activity changes over time.An extended approach for further enriching the VSM-based activity vector was also proposed by processing the datetime and IP address metadata of the log data.This enriched vector representation can be used as input to any classification/predictive model.An eventual application of this model is the immediate identification of at-risk students based on the actions they are exhibiting in the online environment.
Experiments testing the enriched representation on several machine learning algorithms using Python and the Scikit-learning library were also performed.The results indicate that classification algorithms can modestly predict a student's academic performance and, in particular, model the difference between high, low, and failed performances.This modest result indicates that there are more factors that need to be considered in predicting the performance of students in blended courses.More powerful machine learning classification techniques can be tested on the enriched vector representation to further isolate these factors and to determine whether the classification accuracy can still be improved.

Figure 3 .
Figure 3.A 3-dimensional Activity Space Figure 4 provides a preview of the complete and final set of data attributes used to train the classification model.

Figure 6 .
Figure 6.Results of correlation analysis on the data matrix

Figure 7 .
Figure 7. Results of regression analysis on the resulting data matrix

Table 1 .
Dimensions of Action Logs

Table 2 .
Actions types and class activity

Table 4 .
Time metadata grouping

Table 5 .
Classification Accuracy Ratings