STEM or not?
Learning Analytics and Educational Data Mining
Skills and tools: R, SPSS
Use the Educational Log Data to Predict Whether Students Will Go to STEM Track or not.
I. Background and Introduction
The datasets I work with are all from the ASSISTments Longitudinal Data Mining Competition. Since the issue of losing interesting and less enrolments in the STEM are getting more and more serious, there are schools using detectors for the early warning about “drop out” or “losing interest of STEM”.
Using this dataset, we can make the prediction on whether the specific student is going to the STEM track or not in the future, which will benefit by offering possible educational intervention and trying to reignite the passion of the students with the inclination to STEM.
Different from the dataset I have worked with before, this dataset has some features seemed to be interesting and challenging for me. First, the datasets I worked with was relatively small in size and has specific data I want to use for the research purpose. This dataset, however, has a lot of variables after I merged the log data with the name labels together – totally 270293 objects of 81 variables. Lack of experience of handling such a huge dataset, I was overwhelmed and confused about what analysis should be conduct using what tools. Second, the data for this project are collected from ASSISTments system instead of by myself. So the research questions are not generated completely from my interest. Instead, I generate some questions after preliminary analysis and along the way when I try to solve the previous questions.
II. Research Questions
After the preliminary explore and ran some EDA on the data set, I generated the questions at the beginning:
1. How to reduce the dimensionality of the dataset?
2. Does the logistic regression models build within the different subsets show differences in principal components?
3. Will building models within different subsets will increase the accuracy of the prediction?
After I conducted the logistic regression and find out the principal components with high weights, I need to interpret the deeper meaning of the components. So I generated the fourth question:
4. How to interpret the meaning of the principal components?
III. Methods and Analysis
STEP 1.Preliminary exploration
The first thing I did was to combine all the student’s log data together into one dataset. And then Joint the log data with the “training_label” dataset. So the dataset is complete but turns out to be very large with 270293 objects of 81 variables. Since the IDs are not predictive for the final result, I removed them. And some other redundant variables are removed, like and “start time”, “end time” and “time duration” and so on. So I got 69 variables after first round of removing.
To get a sense of the data, I ran the logistic regression in Weka using one subset of “student_log_8”, I exported the result and arrange the attributes in the descending order. The several attributes with high weights are listed below.
As you can see, there are basically two types of attributes showing in the top rank of weights: the general index depicting student’s learning and the skills. To be honest, I am very surprised when I see many skills having such a huge weight in the model. After talking to Prof. Stamper, he suggests me to build models within different subsets based on the different skills. However, I find out that there are totally 93 skills in the dataset. It would not be realistic to split the whole dataset into 93 subsets to build the model and make the prediction. On the one hand, the datasets will be too small to build the model and make the prediction if being split into 93. On the other hand, it will generate too much work but might fail to generate predictive mode in the end.
STEP 2.Classify the skills into different types
To address the problem of many skills above, I decide to use hierarchical clustering to classify the data into different types. Since the data are not a lot if only considering the skills, I run hierarchical analysis in SPSS. The result of the skills is listed in the following chart. And I decided to categorized the skills according to the classifying of the second layer. I think this size of the subset is proper since the small subset will have enough data to build the model and make the prediction and the large set would not be too large to consume too much time.
STEP 3.Principal component analysis
After splitting the whole dataset into four subsets based on the skill type, I ran the principal component analysis in four subsets and the whole set. Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The principal components of the datasets are listed below (just part of the results are listed due to the space limitation).
IV. Results and Insights:
Even though I took the effort to split the dataset and conduct the principal component analysis, the result didn’t get any better and even worse.
According to the result, the model for the whole set is a little bit higher than the four models working together, which means splitting the dataset by skill types is a not successful way to improve the correct rate of prediction. In other words, the skill types are not influencing the final result of the prediction. I tried to think about the possible explanation of that, and I came up with two possible reasons. And I will test my thought in the future.

According to the final result, I find out that not every subset works worse – subset 2 and subset 3 work better than the general one. By observing the dataset, the second and third subset have a similarity that they are not as big as the other two datasets. There are 22004 and 2211 objects in the two datasets with better performance, which is much smaller than the first (164795) and the fourth (53473). So I think it might be the reason for getting better performance. So I will try to split the skill set into more smaller ones to see whether it is influencing the result.

Another reason behind could be that the sample to do preliminary exploration might be biased if I only use “student_log_8”. Since I didn’t expect that there will be differences among the different datasets, I just select the one with the smallest size so I can run it in WEKA really quick and get a sense of the dataset. I think this idea is great, but the way to choose the sample might not be proper. So I will try to select the vertical sampling in the whole population to make the selection more randomly.
V. Next Steps
Excepting the things I will do in the last part, I will explore some other stuff that is not being covered in this project.
First, I will think about how to interpret the principal components after conducting the analysis. For example, the 1,2,3,5,6,10,12,16 principal components have big weights in the logistic regression model with that subset. Take the first three as the example.
PC1: The help requesting behaviors and the time spent.
PC2: Overall learning results about correct rate, careless and gaming the system.
PC3: Affections.
PC1
PC2
PC3
So in the future, I will look into the principal components one by one and interpret their meanings. What's more, I will also compare the principal components across different datasets and see whether there are some similarities and differences because of the different types of skills. I think it will be very interesting to find out some common features. So in the future, the learning system can collect such kinds of data deliberately. And also I will find the differences among the subsets and see whether there is a reason because of the different skill types.
What's more, I am going to learn more about the algorithms using for predicting problem and read more paper to get a sense of the possible reasons behind the decision about whether or not to go to the STEM track.
VI. Reflection
At the beginning, to deal with the large scale of the dataset, I need to switch from SPSS to R in order to deal with such a huge dataset. And during the process, I learned more about the principal component analysis, which is a useful way to reduce the dimensionality of the dataset. By doing this project, I think I not only gain more sense about learning analytics and educational data mining but also get familiar with a totally new tool for me. I think I learned a lot.