Our first results!

There’s a little bit more methodology here – but also some results! I promise!

I have been trying to organise the data so that we create some useful variables ready for analysis. As I said in my last post, you need to know what your data actually means before you start analysing it. But sometimes you will also need to manipulate it before you start your analysis, because the variables you’ve been given aren’t quite right for the analysis you want to do.

One of the aims of the project is to understand the characteristics of people who use the library in a bit more detail. Now we know there’s a relationship between usage and attainment, we need to know whether there are any groups who use the library less than others, so that interventions can be targeted towards them. We’re going to look at a number of factors, including gender (Dave’s already posted some early graphs on this), ethnicity, country of origin, age (again, Dave’s got some graphs) and discipline.

Now, to return to my point about manipulating the data – we have some very detailed information about these characteristics for students at Huddersfield. In fact, in some categories there are only half a dozen students. If we were to use this data, it might be possible for a reader to identify individual students, which is a BIG data protection no-no.

So instead, we need to aggregate up. And that raises all sorts of questions. What are the best groupings we can arrive at in order to (a) protect the students and (b) remain theoretically consistent. It might be, for example, that by aggregating the Chinese and the Black – Other categories we could reach a number which would guarantee students can’t be identified – but do we really think those groups have anything in common with one another? Probably not. So it doesn’t make a lot of sense to bung them in together when trying to understand their library usage.

Here’s some early results, looking at the total number of hours in which a student was logged into e-resources (via EZProxy), and the total number of PDFs that students downloaded, using the aggregates that I’ve chosen. First, the ethnicity ones:

These data are aggregated – each category, apart from Chinese, contains at least 2 different groups from the original data. If you look at the graphs showing the original datasets, which I can’t share here because some of the groups are too small, you can see that there are considerable differences between groups that we’ve aggregated. So that’s something we’ll need to bear in mind – that differences between the groups within our aggregates might be so big as to require us to rethink our groupings. If they don’t all look similar, perhaps it’s not helpful to group them together?

And here’s the same kind of graph, looking at country of domicile.

Again, that’s been aggregated up, in this case by Huddersfield in the original data that was sent to me. I think I’m going to have a play to try and find a happy medium between their highly-aggregated data and the very detailed country-level data, which would reveal individual student identities. I’m particularly interested in what’s going on with those high downloads in the European Community data. I don’t think it’s just an outlier, as there’s a reasonable sample so one or two high usage levels shouldn’t skew the overall picture. So is there a big difference between, say, Balkan states and older EU members?

I suspect that some librarians might be a bit uncomfortable with the idea of using such personal data about students. Ethnicity in particular can be a very sensitive issue, and we will have to tread carefully when we talk about differences in usage. But I think the point to remember is that all of this work is done with the aim of improving services to students. If we can use this data – on a completely anonymous basis (I have nothing to link the data points back to an individual student as the student numbers have been removed before the data was sent to me) – to understand what students are doing, and how it’s affecting their outcomes –and if we remain mindful of our need to protect student confidentiality – I think it can only be a good thing.

The next steps will be for us to test whether these differences are statistically significant – that is to say, whether they are a fluke that’s happened to occur in our particular sample, or whether we can say with statistical confidence that they exist in a wider general population as well. For that, we go beyond the familiar frustrations of Excel to the brave new world of …. SPSS! Wish me luck…