I read an interesting article today about how the International Baccalaureate (IB) program for high school education decided to cancel end-of-year tests for students and use a statistical model to predict final grades during the COVID-19 pandemic. As an online chemistry tutor for high school students in different programs, IB included, as well as a data scientist, I am a little shocked at how little transparency existed in the process to assign final grades to students. It has obviously caused confusion and distress, and led to offers from colleges and universities being rescinded, which should not be a desired consequence. This got me wondering — what kind of questions did the IB administrators and the unnamed company contracted to build the statistical model ask before releasing the final grades? I’m outlining at least some of the discussions I would make sure to have were I working in this team to solve this challenging problem.
Quality of data: what are the train, validation and test sets? It seems like the current students should be the test set, but we should confirm that features from this set and the train and validation set (past years’ grades) are from same distribution.
Model selection: what kind of methodology should we use to solve this problem – a huge neural network with lots of layers or a simple regression that’s more interpretable? We should perform hypothesis testing to measure the efficacy of this model as compared to previous years’ results when final tests were administered.
Feature engineering and preprocessing: what input features were used for building the model? The article claims historical data from the school as well as a student’s performance through the year, but how correlated are year-to-year results from a school? It does not fully make sense to me to use the overall school performance history, so I would want to train a model that uses individual students’ records to predict their final grades for past years. For schools without historical data, can we selectively pool information from schools selected on some criteria (location, resources, past student performance before final tests) rather than just use all schools offering that subject?
Metrics: how is success defined? We want to reduce extreme predictions, so a cost function that is more robust to outliers is not useful. This way, we can ensure that grade predictions continue to be close to actual values for training data, and hopefully, test data. The appendix in the IB letter from their announcement claims that from 2010-2015, 55% students achieved grades different to the ones predicted by their teachers. The model needs to perform better than this to be more accurate.
Bias/variance tradeoff: how can we ensure we are not overfitting to training data? If there are not enough training examples or the overlap between train and test set is low, the algorithm can come up with a complicated model that is too tuned to the specifics of this dataset. This will lead to higher generalization error for the test set, leading to bad prediction for the grades of the current students.
Cross-validation: we should perform cross-validation to check for model validity. This provides a better estimate of the performance metrics we can expect from the model.
Learning curve: we should make a learning curve to see how validation error varies with amount of training data used. This way, one can check if the model is contains representative data for making predictions and make comparisons to other models.
If I had access to the dataset, this is the way I would proceed before deciding on the final model to be deployed. Hopefully the team contracted by the IB program to build their statistical model also brainstormed on these issues before deciding on the final model used to obtain student grades. It’s a challenging problem, especially as it has real-world consequences and hence, careful thought and experimental design needs to put in to algorithms that can change the course of students’ lives.