Student Score Prediction Machine Learning

EDA | Correlation | PCA | Feature Engineering | XGBoost

Preamble

This project was my first externally assessed piece of data analytics work. I had 4 days to explore a dataset and come up with an end-to-end machine learning model (receive user inputs and make a prediction). The dataset consists of approximately 15000 rows and 20 parameters (from bag color to travel mode to attendance rate) on Secondary 4 (15-16 year old) students. The feedback from the assessors was that overall, it was very well done and they were particularly impressed by the PCA breakdown (although for this instance it did not generate new features).

I spent 1.5 days on this section of the work and another 2.5 days on creating and deploying the machine learning model (this was my first time working on deployment - bash scripts and YAML files were foreign concepts so I decided to allocate more time to the deployment of the model. The entire Jupyter Notebook is available below. To check out the deployed machine learning model, look out for Part 2 of this project!

Introduction

The problem given is to create a classification model and regression model based on final test results so that schools can intervene and support students in need before their actual O-levels.

Through this EDA, the following should be achieved:

  1. Successful and complete extraction of data
  2. Removal of erroneous entries
  3. Understanding every feature and establishing relations between features and the target (final test scores)
  4. Encoded features so that they may be used for the candidate model(s)
  5. Engineer features based on relations between features and the target
  6. Fill up empty data as effectively as possible to maximize data utility
  7. Remove features that do not benefit the model(s)

Data Extraction

sqlite3 and pandas are used to quickly convert the database file into an easily manipulable dataframe in the notebook.

Exploring Feature Profiles and Relationships

Different methods are used to determine the characteristics of each feature and their relationships with one another

Summary of Dataset and Feature Profiles with pandas_profiling

Analyzing Pandas Profiling Report Warnings

student_id has a high cardinality: 15000 distinct values High cardinality High cardinality of student_id is expected as each student should have a different id, however it is noted that there are students who appear in the dataset multiple times since the dataset has 15900 rows, this might mean duplicate entries exist.

n_male is highly correlated with n_female High correlation n_female is highly correlated with n_male High correlation This correlation is expected as the number of male and female students in each class should scale with the class sizes of mixed schools Note: If there are n_male == 0 or n_female == 0 cases, it might be an identifier for single-sex schools and a possible additional feature

hours_per_week is highly correlated with final_test High correlation 'hours_per_week' is a numerical representation of effort to study, the correlation is expected and likely to be positive

wake_time is highly correlated with mode_of_transport and sleep_time High correlation The correlation between 'wake_time' and mode_of_transport' suggests that some mode of transports require students to wake up later, needs further analysis to determine the actual relationship The correlation between 'wake_time' and 'sleep_time' is expected as students who need to wake up earlier are likely to need to sleep earlier Note: mode_of_transport likely requires One-Hot-Encoding, there are distinct categories that are not strictly ordinal (in terms of time taken to get to school none of the modes are strictly faster than the others) i.e. you may walk a short time to get to school although walking is slower that private transportation

number_of_siblings is highly correlated with final_test High correlation This negative correlation is interesting and explainable (family resource distribution, distraction levels) and indicates this is an important feature for predicting scores

n_female is highly correlated with gender High correlation This correlation may just be indicating that the gender of the student indicates that there are students of that gender in the class

direct_admission is highly correlated with final_test High correlation This correlation is interesting and needs further analysis to determine if the direct admission is positively or negatively to final test scores Note: The effect of direct_admission may highly dependent on the CCA the direct_admission is for, it may be useful to classify the CCAs into sports and non-sports categories later on if that has not been done and use this feature in combination with direct_admission

final_test is highly correlated with hours_per_week and 3 other fields High correlation The 4 fields are attendance rate, number of siblings, hours of study per week and direct admission state Direct admission state will require additional analysis to determine its actual relation with test score

final_test has 495 (3.1%) missing values Missing No choice but to let this data go, any form of imputation will corrupt the dataset with a bias towards the imputation method.

attendance_rate has 778 (4.9%) missing values Missing Imputation may be good as 4.9% is quite substantial.

n_male has 360 (2.3%) zeros Zeros n_female has 997 (6.3%) zeros Zeros Confirms that there are students from single-sex schools Note_1: In Singapore, there are quite a few single-sex schools that perform relatively well academically. A positive correlation is expected between the school type (reflected by the n_male and n_female) and the test score. However, interestingly the dataset states that the data is from a single school, indicating that the school has some classes with only one gender and other classes with mixed-gender. This may not really be the case, but regardless, it may be useful to add a single_sex_class binary feature to capture this possibility Note_2: Both n_male and n_female are slightly negatively correlated with the final test score, indicating that perhaps the class size is affecting the final test score (high class size, lower score - regardless of gender). Adding an additional feature that explicitly sums the n_male and n_female to give class size is likely to help the models explicitly capture the relationship between the features

tuition A boolean feature that will need analysis to determine its relationship to the test score but the Phik (φk) correlation heat map has shown a positive correlation between tuition and test scores

bag_color 99.99% a feature to be removed

learning_style A categorical feature (audio/visual) that will need analysis to determine its relationship to the test score and whether it should be removed Note: This classification may not be so useful as there have been studies that show that there is no such differentiation, however, it may have resulted in some different treatment of the students or some behavioral/psychological effect on the students who think they are audio/visual learners https://journals.sagepub.com/doi/full/10.1111/j.1539-6053.2009.01038.x

Focused Feature Analysis and Feature Engineering

Bag color and Student ID (bag_color,student_id)

Student ID cannot possibly affect the score in a useful way, but can bag color really affect test scores?

Let us deal with the possible duplicate entries in student_id first.

After going through that exercise, there are no more duplicate student_ids. But there are still entries with missing attendance rate and final test scores. Whether it is better to impute this data in a certain way, leave them or remove them entirely is best figured out by validating the model before and after the adjustments.

Swarm plot visually indicates no difference in score distributions between bag colors (Note: colors not colored to color)

Statistics agrees with logic and confirms negligible differences in distributions between bag color and test scores as well. This feature does not give information on test scores and will very likely be removed before model building.

Enrolled CCA (CCA)

Visually, it is clear that having no CCA results in higher test scores.

Statistically the difference is huge, the mean score of students with no CCA is at least 10 marks higher than those with any CCA. Note that there seems to be minimal difference in performance based on the CCA that belong to. It looks like it may be possible to convert CCA to a boolean to reduce model complexity.

Direct Admission State and Enrolled CCA (direct_admission, CCA)

This sub-section will focus on Direct Admission State and CCA interactions. Note that these two features were isolated due to the domain knowledge that Direct Admission State is closely linked to CCA as most students undergo direct admission through a specific skill which they will develop in their CCA. Alternatively, direct admission students can be students who are participants in academic competitions unrelated to CCAs (e.g. Math/Science Olympiad winners or Language/Humanities top scorers: https://www.moe.gov.sg/secondary/dsa)

First observation is that it is clear that there are many more non-direct admission students than direct admission students.

Second observation is that the distinct swarms at different score levels and significantly larger variance of the direct admission students indicates that there might actually be two or more score distributions within the direct admission students.

The mean scores of the direct admission students are significantly higher (7.3 points) than the other students. The difference between the medians is even larger, at 12 points, indicating that there are a disproportionate number of students in the direct admission pool that have scores at the lower end of the spectrum. The negative skew of the direction admission student score distribution is captured by the fact that the mean is lower than the median.

The difference between direct admission and non-direct admission students distributions is made clear using the overlain KDE plots. The negative skew and split in distributions is visibly caused by a second peak at around 45-49 points.

Plotting the distributions of the direct admission students from different CCAs shows that the direct admission students with no CCA have a distinct distribution from the direct admission students with CCAs. When the CCA is One-Hot-Encoded the distinction will be captured. It might also be worthwhile to create an additional label for direct admission students that distinguishes those who were admitted and are in CCAs and those that were not since the distributions between the two groups are so different

To confirm that the direct admission students with no CCA are a different group from non-direct admission students with no CCA, a KDE plot is made to characterize the two groups. The difference between the two distributions are clear and the additional feature to identify the different type of direct admission students will added as a feature.

Waking Time and Sleeping Time (wake_time, sleep_time)

This sub-section will analyze time related features. A quick note that due to the cyclical nature of time, it should be converted through a cyclical function before model training if the times are going to be compared to one another (e.g. 2300H can be seen as distant from 0000H even though they are 1h apart)

Waking time does not seem to be strongly correlated to test scores Interestingly, the number of students and the distribution at each waking time is similar across different waking times.

Sleeping time distributions show that most students sleep between 21:00 and 0:00. Because fewer students sleep at the later times, it is not visibly apparent how the scores are distributed for each sleeping time, although it is clear that the scoring ranges for those that sleep after 1:00 take up the <50 score range.

A parameter that may be a better indicator of test score would be the number of hours slept, which would be (wake_time -sleep_time) after the data type of both parameters are converted from object to time/datetime.

Notice that sleep hours shows a much clearer distinction between distribution of test scores, with the large majority of students that sleep fewer than 7 hours performing strictly within the <55 score range

The clear difference in score distributions means that sleep hours is a good feature for predicting test scores. Students who sleep 6 hours or less are very likely to score around 43-45 with a standard deviation of around 3.6.

Students who sleep 7 hours and more are likely to score much higher, however the high variance in these sub-groups indicate that there are other factors which affects their score aside from sleep hours.

Attendance Rate, Study Hours and Age (attendance_rate,hours_per_week,age)

First batch of features consisting of continuous data will be analyzed in this sub-section

The regression plot with a order 2 polynomial best-fit line shows that there is an optimum number of hours to study per week (around 10h). It also shows that there are students who supposedly do not study much but perform relatively well as compared to students who study the same amount as them. These students may be anomalies.

Box plots are a good way to identify anomalies visually. By looking for data points that are visibly distant from Q1 and Q3 of the data, anomalies are quickly spotted. As initially suspected, the students who study for less than 3 hours but score above 75 are anomalies. Removing these anomalies may benefit the model training process.

Only 24 entries in the entire dataset fall in this category, removing them from the dataset will likely help the study hours per week feature predict scores more accurately.

Attendance rate shows a clear positive correlation with the test scores with no anomalous activity.

Some entries for age seem to be erroneous since negative age is not possible, they must be removed before model training. There also seems to be some ages that were mislabeled. Since the data is based on O-Level data, the ages should be 15/16. It will be assumed that 5,6 = 15,16. In reality, it is best to clarify with the data owner if this is the case.

No significant difference in score distributions between students aged 15-16. This is expected because both age groups are in the same education system and any advantage from being born a few months earlier becomes insignificant over 15+ years. Age data might be noise in this context and should be considered for removal (to be confirmed during model evaluations).

Tuition, Gender and Learning Style (tuition,gender,learning_style)

Tuition has a clear positive impact on the score distribution. It may be interesting to look at the relationship between study hours per week and tuition state in case there is a hidden relationship between the two features (e.g. students with tuition actually do not count tuition hours as study hours resulting in students with low study hours but 'Yes' for tuition and score well)

It seems that a substantial number of the <=4h study time students that score well have tuition, indicating that some of the previously identified anomalies could have counted tuition hours outside of study hours, or counted tuition hours as study hours excluding any other study hours.

The proportion of students in anomalies with tuition is similar to the proportion in the dataset. Seems like the students with low study hours and high scores are confirmed to be anomalies and can be removed.

Gender alone does not seem to affect the score distribution significantly. But it is related to the possibility of belonging to a all-boys or all-girls class.

Learning style clearly affects the test scores, with visual learners performing significantly better that auditory learning (significantly higher mean by 8 points and higher median by 9 points).

Mode of Transport (mode_of_transport)

It is not immediately apparent how this categorical feature affects test scores.
The categories are ordinal in the sense that they have comparable speeds, but that alone should have no effect on a student's score.
This suggests that it may not be the mode of transport that affects the score, but the implications of using a certain mode of transport.
For example, having private transportation can imply that the student's family can afford a car and hence possibly other resources.
Relating the mode of transport to wake time could also be an indicator of affluence and access to time efficiency (e.g. early wake time and walking implies possible lack of resources, while late wake time and driving could mean an abundance of resources).

Based on the statistics and distributions, there is no significant difference between the performance of students using the different modes of transport.
Let us try to determine if the mode of transport even affects sleep time.

The mode of transport does not seem to affect the number of hours a person sleeps as well.

Travel mode seems to be no effect on sleep time except for a slightly higher density at the 8h sleep hour mark

Travel mode also does not seem to affect the time spent studying.

There also does not seem to be any particular relation between sleep hours, test scores and mode of transport (when considered simultaneously.

Mode of transport is strongly correlated to the wake time. Clearly students who walk get to wake up the latest, while those who take public transport need to wake up the earliest.
But as established earlier, the wake time alone and sleep time is not a good indicator of test performance, explaining the apparent absence of effect on scores that the mode of transport feature seems to exhibit.

Statistically, the difference in wake times is clear, with approximately one hour of wake time increments between those who walk, take private transportation and take public transportation. As discussed earlier, there is no expectation for the wake time to be strongly correlated to the score, hence mode of transport, which is strongly correlated to wake time, also has not strong correlation with test scores.

Female and Male Classmates (n_female,n_male)

This is a possible indicator of a hidden boolean feature - single-sex class/non-single-sex class
It is also an indicator of class size, which is also a possible additional feature.

As anticipated, there seems to be a negative correlation between number of females in the class and the score distribution.

Similarly a negative correlation between number of males in the class and the score distribution can be seen.

Using the joint KDE plots, it is clear that the classes with fewer students have a much more favorable score distribution. When splitting the n_female feature into different class sizes, the categories with fewer students (n_female_cat 1) show a more dominant positive skew towards the higher scores as compared to the categories with more students (n_female_cat 2 and 3)

A similar observation can be made for the male students and their different class sizes. However there are two key differences:
(i) While classes with few females are the majority in the n_female feature, for the n_male feature it is the mid-sized classes that make the bulk of the classes.
(ii) The n_male_cat 2 classes have a slightly more positive skew as compared to the n_male_cat 1 classes, unlike what was seen in the n_female_cat analysis.
This distinction suggests that it is useful to keep the male and female class size features distinct.

There seems to be additional complexity in the class_size distribution with clusters forming at different sections of the grid. This implies there is an additional dimension to the data that is causing clustering of the class size data.

For the male single-sex class the effect of the single-sex class is distinct enough that it shows up on the grid, implying that the male single-sex feature may give additional information about the test performance on top of class size and gender distribution.

Single-sex female classes do not seem to have a distinct performance.

This is confirmed by the joint plot in scatter form, which shows single-sex female classes performing at different levels across the different class sizes. One point to note is that there is a clear positive trend as class sizes get smaller for the single-sex female classes, small classes are in fact a distinct feature of some of the better performing single-sex schools.

Overall, because the number of students from single-sex schools is not substantial and the trends are not clearly apparent.
The effect of the single-sex features will need to be determined during model validation.

Class gender ratio could also be a factor affecting performance, although this is unlikely.

Based on the plot, the gender_ratio feature is not going to be useful as it changes roughly uniformly with the test scores.

Number of Siblings (number_of_siblings)

This was previously noted to have a negative correlation with test scores based on the profiling report.

The regression line confirms that there is a negative correlation between test scores and number of siblings.

The overlain density plots for each sibling category confirms that the distributions are in fact distinct and will be a good feature for predicting test scores. Of interest are the distinct triple peak for the students with 2 siblings and double peak for the students with no siblings. This could be due to a feature that is related to resources (as resource distribution is affected when there are siblings in the family) and is likely to be tuition.

Tuition does indeed seem to cause a rift in the distributions of test score performance of students with 2 siblings, and the lack of tuition explains the peak at around 43 marks, indicating a limit on the performance of some of these students due to the lack of tuition. However, when comparing these distributions against the original distribution plots comparing students with tuitions against those without tuition, it is peculiar that the highest density for students with 2 siblings and no tuition is at the 73 mark region whereas the peak for students with no tuition in general is at around 50 marks. This could mean that students with 2 siblings are in fact 'overcompensating' for their lack of tuition with additional effort which is most likely captured by the hours studied per week.

As confirmed by the statistics above, there are a group of students with 2 siblings that are studying significantly more than their peers, this results in the mean being almost a full hour more than the median.
It is likely that a large portion of this group of students belong to the no tuition group, explaining the unexpected spike at 73 for students with no tuition and 2 siblings.
The takeaway from this is that the tuition state and number of siblings could be an indicator for students lacking in resources, and by using learning-hours to distinguish this group of students that lack resources, the model might be able to better predict that their performance is likely to be above average.

By reversing the previous logic, students with no siblings and tuition are likely to be in a privileged position which allows them to perform exceptionally well. The plot above confirms that the theory holds and the fact that the mean number of hours studied by students with 0 siblings is also almost a full hour more than the median indicates that there is a group of 'overachievers' with no siblings that are studying an exceptional number of hours on top of their tuition. A 'privilege_rating' feature seems highly plausible at this point and will be created first and tested with an actual model later to determine if it helps with the score prediction. This feature is ordinal since privilege runs on a spectrum.

Interestingly, even though the underprivileged and privileged have a group that spends more time studying and pulling up the average study hours of their respective categories, it is not them who contribute to the high scores. Rather, it is the group that studies the statistically optimal 9h per week for the underprivileged and the group that studies 5-10 hours per week in the privileged group that contributes to the high scores (based on the mean and median).

This concludes the focused feature analysis. In the next section, we will encode relevant features and begin making the difficult decisions for feature selection and imputation versus data removal before finally embarking on the model training.

Encoding, Feature Selection and Data Imputation or Removal

We will use both unsupervised feature selection and supervised feature selection

Encoding

The columns are confirmed to be categorical and will be one-hot encoded
Note that even though mode of transport seems ordinal (in terms of speed), the earlier analysis has shown that in relation to the test scores, this ordinal relationship does not hold - faster/slower does not mean better/worse, hence it will be treated as a categorical feature.

This leaves us with the wake and sleep times, which are datetime objects.
For this specific instance, since we are dealing with time one day at a time, it is not necessary to think cyclically, we will remap the time onto a continuous linear scale. It is not flawless, but works for the time range the data is most likely to be in.

The output is what we expect, and establishes the linear relationship between sleep time and wake time.

Data looks like it is almost ready to go. This problem has moderate dimensionality, small data size, data on different scales (which can be scaled if needed), many '0's, around 5% of missing values (non-target features) which may or may not benefit from imputation and for the context of this problem, will require a regression model and classification model as output.
XGBoost is a good candidate for dealing with the above conditions.

Imputation

Only attendance rate still has missing values so this will be the only feature dealt with in this sub-section. Since XGBoost is the highest priority candidate model, all validation of the adjustment's effects will be based on XGBoost. A critical point to note is that XGBoost was designed to handle NaN values and ignore features with NaN values when the specific decision node requires the unavailable data. This means a classification can still be made reasonably accurately even with missing values, however, because it was previously established that attendance_rate has a strong correlation with test scores, it is likely that overall, imputing values into this feature will improve the model's ability to predict scores.

First we establish the baseline for both models, since we want both models to perform well, imputations made to the data should make both models perform better (ideally).

But before that the labels for the classification model needs to be determined. Since education is about equalizing, but resources are limited and resource allocation is about optimization, the students should be banded based on scoring percentiles. Those scoring below a certain percentile will be considered as 'requiring support/attention'. This makes more sense than setting a raw score because the resources should be allocated in proportion to neediness but the number of needy students a school can support is in reality limited. Hence the schools should focus more on those who are performing worst in their school as compared to those who are performing below some specific score (as this might diffuse the attention and resources from those who need it most).

Creating a Target for Classification

The idea is that the weaker the student, the lower the score, this is a classification target with an adaptive threshold. This index can also be directly converted to priority of help that should be given to the students

Conduct Validation of Regression Model with Simply Imputed Data

Conduct Validation of Regression Model with Iteratively Imputed Data

There is a small difference between the performance of the simple imputer and iterative imputer in this case, with the simple imputer generating an average MSE score of 32.49 while the iterative imputer generates an average MSE score of 32.25. We will use the iteratively imputed data in this case.

Conduct Validation of Classification Model with Simply Imputed Data

Conduct Validation of Classification Model with Iteratively Imputed Data

We see that for classification, the iteratively imputed data actually does more poorly than the simply imputed data (a drop in accuracy from 0.7419 to 0.7399. This gives us an idea of how the configuration of the pipeline for each model should be different to optimize the data for the model's performance.
For this problem, and based on the above experiment: (i) simple imputation should be used for processing the data for the classification model, while (ii) iterative imputation should be used for processing data for the regression model.

Feature Selection and Feature Engineering

Removing features that are less obviously useless and creating features that can support prediction. There are many ways to do this, but for the scope of this EDA we will only use PCA to engineer new features, and sequential feature selection to select features. Feature engineering will be attempted first in case useless features are engineered - in which case the feature selection can remove these features at a later stage.

First off, because PCA is a function of variance (a geometric attribute of the data), it is important to standardize the data in applicable columns (columns where the data are actual quantitative distributions, not some serial information) to bring them into the same scale (prevent specific features from dominating just due to magnitude).

Feature Engineering with PCA

Use explained variance to identify key principal components with features that vary relatively largely against each other This can be useful for generating new features based on the relationships of the features.

Typically, the relationship between features with high explained variance can be tricky to decipher.
PC1: This component highlights the strong negative correlation between the number of students of opposite gender and the high variance is compounded by the negative correlation between student's class gender categories (i.e. class has many males, means class likely to have few females)
PC2: The second component highlights the strong positive correlation between attendance rate and sleep hours. It also highlights that the aforementioned features are negatively correlated with the sleep time. This actually indicates that sleep hours may be a useful feature for score prediction since we know that it is quite a different feature from attendance rate, yet supports the 'positive behavior' while being negatively correlated to a 'negative behavior'.
PC3: The third component is simply highlighting the negative correlation between the privilege feature that was included and the numbe of siblings. This is expected since privilege was intended to negatively correlated to the number of siblings.

Based on the mutual information, some PCs with high MI scores have been surfaced as strong predictors, indicating the combination of features in those PCs with high loading are useful for determining how a student will score. Conversely, other PCs with low MI scores may indicate that the relationships between features in those PCs are not useful for determining how well a student will score.
Note, the +/- indicates the relationship between features in the PC, critical to understanding what they mean

Features with High Loadings in High MI Score PCs
--PC8+ number_of_siblings
--PC8- privilege
--PC9+ n_female_cat
--PC9- n_male_cat
--PC11+ n_male
--PC11- n_female

Features with High Loadings in Low MI Score PCs
--PC13- n_male
--PC13- n_female
--PC13+ class_size
--PC12+ wake_time
--PC12- sleep_time
--PC10+ attendance_rate
--PC10- sleep_hours

PC8 highlights a condition that was noticed during feature analysis, which is that privilege can be derived from some of the other features and is useful for predicting scores.
PC9 and 11 would have been useful for hinting that single-sex classes or gender ratios might have been features to take note of, but those have also already been explored and implemented/removed.
Overall, it seems that the feature engineering in the earlier stage has been comprehensive enough to capture and even create new useful dynamics between features for score prediction, no additional features will be added.

Feature Selection

Sequential feature selector uses a greedy algorithm to choose the most useful features one by one, or remove the most useless features one by one. Greedy algorithms produce local optima, hence there may be different results depending on which method is used. Generally, using both and finding the intersect would be a balanced approach to selecting the best features.
There are 30 features in X_iter at the moment, for the first cut, we will select the top 20 features using the greedy algorithm in opposite directions and remove the features that have been removed by both algorithms.

Feature Selection for Regression Model

Feature Selection for Classification Model

Compared against the K-fold Validation Score for the regression model conducted with the full X_iter, this score is actually worse, increasing from 32.25 to 32.38 meaning that we may have removed too many features.
However for the classification model, the removal of these features actually improved the accuracy from 0.7399 to 0.7468.

Adding back the sleep hours improved the error from the original 32.38 to 32.25. This may be the sweet spot considering that the removal of more features will result in a worse score and adding back female_class will bring the score back to the original.

For the classification model, adding back the sleep hours seems to have made the classification slightly worse by 0.02% accuracy, suggesting that there is actually room for reduction of complexity to benefit the classification model performance, or that a more thorough validation of model using the data is required to more accurately determine the impact of removing sleep hours.

Further experimentation shows that it is better to remove age and wake time as opposed to sleep hours. This makes sense as earlier analysis has shown that sleep hours has some correlation to test scores, while age and wake time did not show such correlations. Realizing this highlights the importance of understanding the features, and also the potential weakness of sequential feature selection (greedy algorithms will not always produce the global optimum in results).

Expected Performance

Having validated the model that is going to be used, it is always good to understand its weaknesses (so that they can be addressed). We will use a confusion matrix to see which categories the classification model has difficulty getting right.

From the confusion matrix above, it is apparent that the model is showing a bias towards classifying students in higher Grades.

We can see that the model has difficulty classifying students in their exact final grade category (e.g for Final Grade 1 students only 104/147 (70.7%) of the students were correctly categorized - it is not very sensitive to the performance of Grade 1 students). Given that we know that the number of the students that belong in Grade 1 based on the percentiles we set for the grade thresholds, there is sufficient data in terms of volume (relative to the data set given) to characterize a Final Grade 1 student. The poor sensitivity could simply mean it is harder to predict students that are going to perform poorly as opposed to those who will do well (Final Grade 4 student prediction has a sensitivity of 538/612 (87.9%). Or it could also mean that the quality of data collected for students belonging to Final Grade 1 is poorer (e.g. false data being fed on the number of hours studied per week etc.). Among all the parameters to analyze the the confusion matrix with, sensitivity is the most relevant as the school's top priority is to prevent students from falling through the cracks (in this case, being falsely classified as negative. To improve the sensitivity towards the characteristics of Grade 1 students, it would be good to either increase the quantity of data from Grade 1 students in the dataset or to improve the quality of data collected from these students. It would also be good to collect data specific to the identification of Final Grade 1 students - perhaps such as "detentions_received" or something similar. From a model-side perspective, identifying the features that help differentiate a Grade 1 student from other students (if such a feature exists) and assigning them a larger weight would address this issue.

If more time was available, experimenting with the exact percentiles to best split the threshold would be useful as well. For example, if the Grade 1 percentile threshold is too high, the characteristics of students who truly need help will be mixed with those who are on the borderline, or perhaps even just average. For this dataset, the threshold corresponds to those who score 48 marks and below for the final exam which is reasonable.

However, knowing that the model has a bias towards giving students a higher grade allows the user of the model to do one simple thing to take advantage of this fact: Take both Grade 1 and Grade 2 students as those who should be focused on. Based on the split above, doing so will capture 93.2% (137/147) of the students who require assistance (based on our percentile assumption), which shows good potential for a model which has not been tuned.

Predicting scores can be extremely difficult as exams are not the best environment for consistency. Even if a model has successfully identified that a student was supposed to perform well, it is possible that in the final exam the student fumbles due to stress, carelessness or inability to focus. If the model is expected to capture this information as well, it will be good to take multiple test scores and consolidate their average and variance as a proxy for performance consistency.

Conclusion

Through this EDA, we have achieved the core objectives that were aimed for:

  1. Cleaned up the data set nicely, removing rows with negative entries and duplicate entries as required
  2. Understood the characteristics of each feature using the profile report in combination with additional visualization and statistical analysis
  3. Encoded features based on the understanding of what their data represents
  4. Used logic and domain knowledge to establish relationships between features to create new features
  5. Used different imputation methods to determine which methods work better for which model
  6. Conducted PCA to statistically check for any additional relationships between features which may have been missed out
  7. Gone through every feature and removed features that do not assist in score prediction using understanding of the problem and sequential feature selectors, while identifying features that may be beneficial to one type of model but not the other

Note: Although the data was split into train, validate and test sets, there was no testing done in this EDA. Only cross validation done within the train set. A mini-exercise is done in the annex to show that the performance on the test set is similar to the performance assessed using cross validation.

Thank you for embarking on this exploration with me :]

Annex

Continuing from the last classification example. A test is run to determine if the cross validation is a good indicator of actual performance. The tests are run on the classification model, but the understanding applies to the regression model as well.

We can see that the xgb_c fitted on the training set and has cross validation scores that are very similar to the test set scores. This means that the cross validation scores are a reasonable indicator of how the model will perform with test sets.

In this second experiment, the test set data was included during training as well. As can be seen, the performance is only slightly better, indicating that there is no overfitting of the model.

In this last experiment, the model is allowed to overfit to the data before being tested on the data it was trained on. This shows that the model works well for this problem and can indeed by fitted to the data 'perfectly', implying that if it is if given enough (well-processed) data, it should be able to make good predictions on test performance.