One of the original aims for this project was to see whether library usage data could be combined with other variables to build a model that might help predict student outcomes. At first we thought this mightn’t be possible – student results weren’t normally distributed, and this precluded any regression analysis. But, once we weeded out the part-time, non-Huddersfield-based students, we found that the results were normal. Hurrah! This is always a happy, happy moment for statistics bods.
So, I set out with the data that I had to try and build a model. The aim, in statistics, is to try and build the best predictive model using the fewest input variables. This is for several reasons: it’s practical (saves you from having to collect loads of data which doesn’t add much to the final prediction), it’s elegant, and it reduces the likelihood of problems such as multicollinearity, where two or more of your variables are correlated and which stuffs up the predictive power of your model. It’s also important that you only use variables which you have a strong theoretical reason for including – randomly picking variables to see which one gives you the best result is called data mining, and it’s Not Allowed. For these reasons, I only used the library usage variables which we’ve already shown to be related to outcome: e-resource hours, PDF downloads, number of resources accessed 1, 5 and 25 times, and number of items borrowed.
Regression analysis in SPSS gives you about a million different outputs, but the key one, in terms of understanding the predictive ability of the model, is the adjusted R². This figure ranges from 0 (terrible) to 1 (perfect, and highly suspicious!). I was taught that in the social sciences an adjusted R² of around .7 is considered pretty good going. The one we achieved with this model is .106. Oh dear.
Furthermore, there do seem to be some potential problems with multicollinearity. This isn’t hugely surprising – after all, we ought to expect the three variables for number of e-resources accessed to be related, especially the last two – if you’ve accessed something 25 times you have also accessed it 5 times after all! It may also be that there’s a relationship between the hours spent logged into e-resources and the number of resources accessed.
So I next tried a more parsimonious model; one which eliminated the variables for number of e-resources accessed 5 and 25 times. The adjusted R² for this model is .107 – fractionally better – and the multicollinearity seems to be less of a problem. But it’s still not a great predictor. For the next step, I added in the demographic variables that we have, and got a model with an adjusted R² of .177. This pretty much exhausted all the data that I had available to me, and I have therefore declared the answer to our original question to be – no, you can’t build a model to predict outcomes with the data available to this project.
Now, that’s not to say that nobody could. There are a lot of variables we haven’t included. In some cases this is because we don’t have the data – for example, UCAS points might be a useful indicator of a student’s potential on arrival and a good thing to include, if you have them. Others are because the data is simply too difficult to collect – how do you measure a teacher’s ability (as opposed to how that ability is perceived by their students!), or stressful life events that a student is dealing with at exam time?
It would be really interesting to see whether integrating the missing information such as UCAS points would improve the model – and whether we could try and find a way of measuring proxies for some of those other issues: contact with pastoral staff, student satisfaction surveys. Of course, this would raise a lot of ethical issues and couldn’t be done lightly. But it would be interesting…