Final blog post: managing the data

In terms of the data, this project was both simpler and more complex than the first stage. The fact that we were only dealing with data from a single institution made a big difference: we didn’t have to worry about ensuring that our classifications of data matched across different institutional systems and we only had one set of people to talk to when the data wasn’t quite what we were expecting!

On the other hand, the data itself was considerably more complicated. We had a larger number of variables to manage, and in many cases the data didn’t present itself in the ideal format for analysis. So a lot of the work on this round of the project was taken up with recoding data – and that involves not just hours of shouting at Excel but also a bit of brainpower to decide how to create groups and sub-groups within different variables.

Let’s take, as an example, the ‘ethnicity’ variable. The data provided by Huddersfield divided students into 20 different ethnicity categories, some of which had only a handful of students in them. This was problematic for two reasons – first, the students might potentially be identifiable in such small groups, and second, it was very unlikely that statistical tests would reveal any meaningful differences at that level of detail. By grouping the 20 categories into 5 or 6, we protected student identities and increased the viability of our statistical tests.

Now, it’s important to stress that you must not simply regroup data in order to get a statistically significant result, trying lots of different groupings until you get one that gives you the numbers you want. We had to have some theoretical underpinning to our categorisation – and this is where the ‘brainpower’ I mentioned earlier comes in. We could have categorised everyone into ‘white’ and ‘non-white’ but this wouldn’t necessarily give us particularly useful information: it would not reveal any differences between non-white groups, which might be important when planning activities to support those various groups. So we were looking for a happy medium somewhere between 20 and 2 groups.

Luckily, this is a problem that other researchers have faced before, and we were able to borrow from some fairly standard categories (you can see them in more detail in this post). Even then, though, we had to scratch our heads a bit to map the Huddersfield categories onto the new ones. For example, we had several groups called things like ‘Black – Other’ and ‘Asian – Other’. Should we put these into one big ‘other’ category, or into ‘Black’ and ‘Asian’ respectively? In the end, we felt that we didn’t have enough information to put them into the more specific categories. For example, our ‘Asian’ category referred specifically to the Indian subcontinent, and we grouped China separately. But we didn’t know whether our ‘Asian – other’ students were of Indian or Chinese heritage, or whether they were from one which didn’t fit into our existing groups – Korean, or Thai, for example.

In some cases, we didn’t have standard categories to borrow from, and had to create our own. For example, our data held only the names of courses that students were following, not their discipline or school. So when trying to group by subject, we had to start from scratch, taking the 100-odd courses and grouping them to allow analysis. We did this with the help of the library staff: it’s fairly likely that if we’d used other Huddersfield staff members we would have ended up with slightly different classifications. In the same way, other institutions might have a different take on how best to organise such data, according to their own organisational set-ups.

All in all, arranging and managing the data was one of the more time-consuming elements of the project. This is something that is only likely to get trickier if the project were working with data from several institutions. This begins with coding the text-based data: for example, imagine that one institution records students as ‘Black – African’, another, as ‘African (Black)’ and another as ‘Black African’. It’s not too difficult to get Excel to recognise that these are all the same thing, but it would be very time consuming to do it for each response in a number of variables. And for some variables it might involve some intellectual effort to decide upon equivalence – discipline is a particularly good example of this. Each university will offer slightly different courses, and it will take work to map them onto each other.

After the variables have been coded and brought into alignment, another challenge exists in ensuring that analysis works for all partners. Take country of domicile as an example. Some universities may have a particularly large number of students from, say, the Middle East, and want to separate that out as an analytic category. But, given that we need to keep the overall numbers of analytic categories low in order for the analysis to work and that different universities are likely to have different areas of geographic interest, how do we decide on the best groupings?

We’ll have to have a think about how best to address these challenges, if the research is to progress to include a wider number of institutions who want to work together and pool their data.