Hello.

Hello! You’ve changed the format of the blogs, haven’t you? Yes, we like to mix it up a bit. We thought this might be fun, and not at all derivative.

What are these bonus findings then? Well, it all stems from the fact that this time round we have been given students’ final grades as a percentage, rather than a class. Continuous rather than categorical data. This opens up a whole new world of possibilities in terms of identifying a relationship between usage and grades.

Wait, didn’t you already prove that in Phase 1? We certainly did.

So why are you doing it again? Well, to not-quite-quote a famous mountaineer – because we can. It’s important to be clear that we’re not trying to ‘prove’ or ‘disprove’ results from the previous phase. Those stand alone. We’re simply taking advantage of the possibilities offered by the new data.

And those possibilities are…? Remember Spearman’s correlation coefficient from the last post? Well, we can use that again. As you’ll remember from earlier posts, it’s best to keep continuous data continuous if you can. The first round of the project gave librarians with percentage grades – continuous data – a methodology which required them to convert said grades into classes – categorical data. So we’re outlining this technique for their benefit – it’ll save time AND it’s better!

But if you’ve only got the class-based data Not a problem! Use the old technique, which is designed for class-based data. This is just about giving people options so that they can choose whatever fits their data best.

Right. Got it. So, what did you find? This might be where you have to take a bit of a back seat, my inquisitive friend.

In fact, we found  absolutely nothing to surprise us. The findings echo everything we established in the first phase, and the additional work we’ve done with extra variables in this phase. Figure 1 shows the effect sizes and significance levels for each variable.

As usual, I’ve only reported the statistically significant ones, and they are exactly the same as the ones that were statistically significant in our previous tests. You can see that, again, we’ve found a slight negative correlation between the percentage of e-resource use which happens overnight and the final grade. Once again, I’m inclined to dismiss this as a funny fluke within the data, rather than an indication that overnight usage will improve your grade.

So nothing new to report? Not really. Just a new method (outlined in the toolkit) for those librarians who want to take advantage of their continuous datasets.

Good news! We have acquired some extra data from various very helpful departments at Huddersfield, which allows us to explore a few additional angles for both indicators of usage and the relationship between usage and outcomes.  The new data relates to the UCAS points of the students within the study, and to the percentage of each student’s total e-resource usage that occurs on campus.

Let’s start with the UCAS points. We’ve treated these as a kind of ‘demographic’ characteristic and, as with the ones we’ve already posted about, we wanted to see whether there was a correlation between them and the library usage variables. Because both the UCAS points and the usage data are continuous variables, but not normally distributed, we used a slightly different measure than in our previous work – Spearman’s correlation coefficient. Figure 1 shows the findings.

I’ve only included effect sizes for the ones that were statistically significant and, as you can see, there aren’t many of them. And even for those that were, the effect sizes were pretty small. This is an interesting finding, suggesting that it’s not necessarily the case that high-achieving students are simply more likely to use the library and that this lies behind the relationship we’ve seen between usage and outcomes in phase 1 of the project – we’d have to run further tests to check whether this is in fact the case, of course.

Next, we moved on to look at the percentage of e-resource usage which occurred on Huddersfield’s campuses (as opposed to offsite – for example, at home or in a coffee shop). Again, as with the earlier measures of usage, we wanted to see whether this varied with some of the demographic variables. Figure 2 shows our findings: I’ve only included the ones with statistically significant variations.

Younger students and men spend proportionally more time than mature students and women accessing e-resources on campus (as compared to other locations). The same is true for Asian and black students, compared to white students;  for computing and engineering students compared to social scientists; and for students based in the UK compared to those based in China. Remember, these are just the ones with statistically significant differences, so in fact there are not that many – and they are all small effect sizes.

Finally – and perhaps most interestingly – the relationship between percentage of usage of e-resources on campus and final degree results. Figure 3 shows the findings from this analysis; again, only the statistically significant results are shown.

You can see that, although the effect sizes are small, there are differences between the percentage of on-campus usage for students who go on to achieve a First and 2.ii or Third, and a 2.i and a 2.ii. In each case, the students who go on to achieve higher grades are the ones who have had a lower percentage of on-campus e-resource usage. This tells us that there is some relationship between reading electronic content in locations other than the university and doing well in your degree.

In terms of the data, this project was both simpler and more complex than the first stage. The fact that we were only dealing with data from a single institution made a big difference: we didn’t have to worry about ensuring that our classifications of data matched across different institutional systems and we only had one set of people to talk to when the data wasn’t quite what we were expecting!

On the other hand, the data itself was considerably more complicated. We had a larger number of variables to manage, and in many cases the data didn’t present itself in the ideal format for analysis. So a lot of the work on this round of the project was taken up with recoding data – and that involves not just hours of shouting at Excel but also a bit of brainpower to decide how to create groups and sub-groups within different variables.

Let’s take, as an example, the ‘ethnicity’ variable. The data provided by Huddersfield divided students into 20 different ethnicity categories, some of which had only a handful of students in them. This was problematic for two reasons – first, the students might potentially be identifiable in such small groups, and second, it was very unlikely that statistical tests would reveal any meaningful differences at that level of detail. By grouping the 20 categories into 5 or 6, we protected student identities and increased the viability of our statistical tests.

Now, it’s important to stress that you must not simply regroup data in order to get a statistically significant result, trying lots of different groupings until you get one that gives you the numbers you want. We had to have some theoretical underpinning to our categorisation – and this is where the ‘brainpower’ I mentioned earlier comes in. We could have categorised everyone into ‘white’ and ‘non-white’ but this wouldn’t necessarily give us particularly useful information: it would not reveal any differences between non-white groups, which might be important when planning activities to support those various groups. So we were looking for a happy medium somewhere between 20 and 2 groups.

Luckily, this is a problem that other researchers have faced before, and we were able to borrow from some fairly standard categories (you can see them in more detail in this post). Even then, though, we had to scratch our heads a bit to map the Huddersfield categories onto the new ones. For example, we had several groups called things like ‘Black – Other’ and ‘Asian – Other’. Should we put these into one big ‘other’ category, or into ‘Black’ and ‘Asian’ respectively? In the end, we felt that we didn’t have enough information to put them into the more specific categories. For example, our ‘Asian’ category referred specifically to the Indian subcontinent, and we grouped China separately. But we didn’t know whether our ‘Asian – other’ students were of Indian or Chinese heritage, or whether they were from one which didn’t fit into our existing groups – Korean, or Thai, for example.

In some cases, we didn’t have standard categories to borrow from, and had to create our own. For example, our data held only the names of courses that students were following, not their discipline or school. So when trying to group by subject, we had to start from scratch, taking the 100-odd courses and grouping them to allow analysis. We did this with the help of the library staff: it’s fairly likely that if we’d used other Huddersfield staff members we would have ended up with slightly different classifications. In the same way, other institutions might have a different take on how best to organise such data, according to their own organisational set-ups.

All in all, arranging and managing the data was one of the more time-consuming elements of the project. This is something that is only likely to get trickier if the project were working with data from several institutions. This begins with coding the text-based data: for example, imagine that one institution records students as ‘Black – African’, another, as ‘African (Black)’ and another as ‘Black African’. It’s not too difficult to get Excel to recognise that these are all the same thing, but it would be very time consuming to do it for each response in a number of variables. And for some variables it might involve some intellectual effort to decide upon equivalence – discipline is a particularly good example of this. Each university will offer slightly different courses, and it will take work to map them onto each other.

After the variables have been coded and brought into alignment, another challenge exists in ensuring that analysis works for all partners. Take country of domicile as an example. Some universities may have a particularly large number of students from, say, the Middle East, and want to separate that out as an analytic category. But, given that we need to keep the overall numbers of analytic categories low in order for the analysis to work and that different universities are likely to have different areas of geographic interest, how do we decide on the best groupings?

We’ll have to have a think about how best to address these challenges, if the research is to progress to include a wider number of institutions who want to work together and pool their data.

One of the original aims for this project was to see whether library usage data could be combined with other variables to build a model that might help predict student outcomes. At first we thought this mightn’t be possible – student results weren’t normally distributed, and this precluded any regression analysis. But, once we weeded out the part-time, non-Huddersfield-based students, we found that the results were normal. Hurrah! This is always a happy, happy moment for statistics bods.

So, I set out with the data that I had to try and build a model. The aim, in statistics, is to try and build the best predictive model using the fewest input variables. This is for several reasons: it’s practical (saves you from having to collect loads of data which doesn’t add much to the final prediction), it’s elegant, and it reduces the likelihood of problems such as multicollinearity, where two or more of your variables are correlated and which stuffs up the predictive power of your model. It’s also important that you only use variables which you have a strong theoretical reason for including – randomly picking variables to see which one gives you the best result is called data mining, and it’s Not Allowed. For these reasons, I only used the library usage variables which we’ve already shown to be related to outcome: e-resource hours, PDF downloads, number of resources accessed 1, 5 and 25 times, and number of items borrowed.

Regression analysis in SPSS gives you about a million different outputs, but the key one, in terms of understanding the predictive ability of the model, is the adjusted R². This figure ranges from 0 (terrible) to 1 (perfect, and highly suspicious!). I was taught that in the social sciences an adjusted R² of around .7 is considered pretty good going. The one we achieved with this model is .106. Oh dear.

Furthermore, there do seem to be some potential problems with multicollinearity. This isn’t hugely surprising – after all, we ought to expect the three variables for number of e-resources accessed to be related, especially the last two – if you’ve accessed something 25 times you have also accessed it 5 times after all! It may also be that there’s a relationship between the hours spent logged into e-resources and the number of resources accessed.

So I next tried a more parsimonious model; one which eliminated the variables for number of e-resources accessed 5 and 25 times. The adjusted R² for this model is .107 – fractionally better – and the multicollinearity seems to be less of a problem. But it’s still not a great predictor. For the next step, I added in the demographic variables that we have, and got a model with an adjusted R² of .177. This pretty much exhausted all the data that I had available to me, and I have therefore declared the answer to our original question to be – no, you can’t build a model to predict outcomes with the data available to this project.

Now, that’s not to say that nobody could. There are a lot of variables we haven’t included. In some cases this is because we don’t have the data – for example, UCAS points might be a useful indicator of a student’s potential on arrival and a good thing to include, if you have them. Others are because the data is simply too difficult to collect – how do you measure a teacher’s ability (as opposed to how that ability is perceived by their students!), or stressful life events that a student is dealing with at exam time?

It would be really interesting to see whether integrating the missing information such as UCAS points would improve the model – and whether we could try and find a way of measuring proxies for some of those other issues: contact with pastoral staff, student satisfaction surveys. Of course, this would raise a lot of ethical issues and couldn’t be done lightly. But it would be interesting…

In this round of analysis, we had some new metrics of library usage which weren’t available to researchers in the first round of the project. I’ve already blogged about one of them – overnight usage – and there’s more on that later in this post, but first I’d like to talk about the other three. These are – the number of e-resources accessed (as distinct from the hours spent logged into e-resources); the number of e-resources accessed 5 or more times, and the number of e-resources accessed 25 or more times. These metrics show how many of Huddersfield’s 240-odd e-resources, which range from large journal platforms and databases down to individual journal subscriptions, a student has logged into during the year, once, at least five times or at least 25 times.

You’ll already have seen these three dimensions in the posts on our demographic and subject-based analysis. But now I’m going to see whether there’s a relationship between use on these dimensions and final degree outcome, using the same methodology as Phase 1 of the project. Figure 1 shows the results.

Figure 1: Usage and final degree outcome

As you can see, there are quite a few statistically significant differences for the e-resource dimensions. Most of these are small effects, but the difference between a first and a third is medium sized for the number of e-resources accessed, and the number accessed five or more times. This is a really interesting finding. It suggests that breadth of reading – indicated by using a number of different e-resources – might be a particularly important factor in degree success, and leads to all kinds of questions about how the library might support students in reading widely.

You can see that we’ve found a difference for the percentage of overnight usage as well! Weirdly, this only pops up in relation to the difference between 2.i and 2.ii degrees, and it’s a miniscule effect. I’m inclined to dismiss this as a blip and go with our previous finding that there isn’t a significant difference between grades in terms of their overnight usage: with the same caveat that our model is different from the one used by Manchester Met, and thus perhaps not as able to identify nuanced differences.

I’ve already posted some early findings about usage and dropping out, showing that there is a relationship between not using the library and dropping out. As you might remember, these findings had a fairly big health warning attached to them.

Since that blog was posted, I think I’ve found another way to look at the relationship between non-use and dropping out, one which is a bit more nuanced. You might remember that I created a ‘binary’ dummy variable for use of the library, in order to get around the problem of cumulative usage (we’d expect a third year student to have higher usage than a first year student, simply because they’ve been at the university longer, but we didn’t have access to the year-of-study data that would allow us to correct for this). I mentioned at the time that turning a continuous variable into a categorical one isn’t an ideal solution, and it turns out that a better one was staring me in the face the whole time (this, I find, is a frequent joy of statistical analysis!).

In that very same blog, I said that we were only looking at the most recent year of data: this was to give first and third year students the same opportunity to move themselves from the ‘non-user’ to ‘user’ category. But (and I’m kicking myself now for not seeing this at the time) once you look only at the last year of data, you can use cumulative measures again! After all, if you only look at one year of data for every student, you eliminate the problem whereby third year students will have accumulated more use simply by virtue of the fact that they’ve been there longer. Those first two years cease to be relevant as they’re not counted within the measure! Of course, we still have the problem of different patterns of usage in years 1, 2 and 3, but I think we just have to shrug our shoulders and accept that.

So, I have run some new tests – Mann Whitney U again – using a cumulative measure of usage for the first two terms of the 2010-11 academic year. As you might remember, we’re only looking at people who dropped out in term three (again, this is to try and give the dropouts a decent amount of time to establish their usage patterns: including people who dropped out in their second week of term is going to skew our results as they may not even have had their library induction by that point!). This means that all the students included in this study were at the university in the first two terms, and they have all had exactly the same opportunity to accumulate usage.

You might remember that in the earlier analysis I mentioned that we couldn’t separate out the full-time and part-time students as the sample became too small. Well, the new test resolves that problem, at least in part. Figure 1 shows the effect sizes for the difference in usage between current and dropout students: in the first column, full-time and part-time students based on all sites (Huddersfield and its secondary campuses at Barnsley and Oldham); in the second column, full-time and part-time students based at Huddersfield only; and in the third column, full-time students based at Huddersfield.

Figure 1: Retention and usage

You can see that as we decrease the sample size by removing certain categories of student, some of the dimensions cease to be statistically significant. When we remove sites other than Huddersfield, library PC use stops being significant, and when we limit our analysis to full-time students, the number of library visits also stops being significant. This tallies with the findings from Phase 1 of the project, which found that items borrowed, e-resource hours and PDF downloads were the only dimensions to have a statistically significant relationship with final degree outcome.

The effect sizes here are all very small, but they are significant and they are there for the number of items borrowed, the hours logged into e-resources and the number of PDF downloads. This reinforces the findings from my earlier analysis; that non-usage of the library can’t be taken, on its own, as a predictor of dropping out, but that there is some kind of relationship which suggests that it could be a useful warning signal, taken in the context of other warning signals such as course tutor feedback, lack of attendance at lectures and so forth.

It might have been interesting to look at some of the other dimensions of use in relation to dropping out. Percentage of overnight use is one obvious one: we didn’t find a correlation between overnight use and poor grades, but perhaps there’s one between overnight use and dropping out? The number of e-resources accessed might also be interesting. We can’t do this with the Huddersfield data at the moment, because both of those dimensions contain information about the full year and it’s not possible to extract terms one and two for analysis. Maybe that’s one for Phase 3…?

We’re moving on in this second of our final blogs to look at the relationship between discipline and usage. There are some interesting – although not necessarily unexpected – findings here, and bigger effect sizes than we saw with the demographic variables. Again, I’ve only shown the differences that are statistically significant: effect sizes are highlighted by the depth of colour, with small effects pale, medium effects a little darker, and large effects very dark.

You might remember that we had to aggregate some of the demographic categories – such as ethnicity and country of origin – both to protect student confidentiality and to get meaningful results. The same is true of our discipline-based analysis. But, because we suspected that this was going to be quite an important area, we’ve decided to go a bit more granular. So, we’ve created two levels of analysis. Each of the 100-odd courses offered by Huddersfield to its full-time, undergraduate students has been classified into one of 17 ‘clusters’. These clusters have then been aggregated to form six ‘groups’. We can compare the groups to see some overarching differences, and then drill down to compare clusters within groups, to get a more detailed understanding.

It’s important to mention that the grouping work was done by librarians and student support staff, so it represents the relationships that they see between the different courses. I suspect that if we’d asked course tutors, lecturers or students themselves we might have seen slightly different combinations. Also, the grouping was slightly driven by numbers: we had to make sure that there were enough people in each category to make the statistical tests viable and to ensure that anonymity was protected.

Limitations duly mentioned, let’s go on to look at the findings. First, the aggregated subject groups. We used the social science group as our control, as it was the biggest. As you can see from Figure 1, it is also the higher user in most comparisons where significant differences exist: the only exceptions are the comparisons with the number of items borrowed and the number of e-resources accessed by health students. We think this might be because health students are often out on placements, limiting their opportunities to visit the library but making e-resources more important. Furthermore, e-resource use is heavily embedded into the nursing curriculum, and most students will have classes which require them to go into the library’s e-resources and search for items.

Figure 1: Aggregated subject groups

The overall takeaway from this figure, I think, is that computing and engineering students are less intensive users on a number of dimensions, with a medium effect for the number of items borrowed. Arts students are very low users compared to the social sciences, with medium effects on several dimensions.

Let’s move on now to look at how the smaller clusters relate to each other within the groups. Poor old science was in a group all by itself: it wasn’t possible to sub-divide this one so we’ll skip over it and look at health. This has been divided into nursing and other health disciplines (including subjects such as sports therapy, physiotherapy, podiatry and occupational therapy) and you can see from Figure 2 that nursing is a bigger user on pretty much every dimension. There’s a large effect for the number of e-resources accessed, suggesting that health students are reading more widely than their nursing counterparts. This might be because nurses are required to use a certain number of documents for some of their assignments – ten pieces of a specific kind of research for systematic reviews, or four journal articles for an early assignment; the wide use might represent their efforts to find exactly the right kind of resource. There are also medium-sized effects for library visits, hours logged into e-resources, number of e-resources accessed 5 or more times and number of PDF downloads and small effects for hours logged into library PCs and number of e-resources accessed 25 or more times; in each case, nurses are the higher users.

Figure 2: Health group

The computing and engineering group has been divided into – errr – computing and engineering! The differences here are fewer and smaller. Perhaps unsurprisingly, engineers use the library PCs more often – presumably the computing students are happily tapping away on their personal laptops. And computing students are downloading more PDFs. But on the whole, the behaviour of these two clusters is quite similar.

Figure 3: Computing and engineering group

 

Now we move onto the humanities group, which has been subdivided into three clusters: English, drama and media and journalism. The first thing to note is that there are no statistically significant differences between English and drama students. But there are differences between these two clusters and the media students, and where those differences exist the media students have the lower level of usage. Most of these differences relate to e-resource use: the English students have higher use on pretty much all the e-resource dimensions with medium-sized effects. The differences between media and drama are smaller.

Figure 4: Humanities group

Now, on to Figure 5 and the social science group, which looks colourful and complicated! The first thing to note is the number of cells which are shaded in dark colours: there are a lot of big effects within this group. Overall, behavioural sciences dominate. They have higher usage, at a statistically significant level, than every other cluster on at least one dimension. Business is the next-most-dominant discipline, although it’s worth noting that business students borrow fewer items than their colleagues in every other discipline except law, and that the effect sizes here are medium or large. Lawyers, in fact, have the lowest use compared to most subjects, but they do use the library, and especially its computers, more than their counterparts in social work and education. Finally, there are no significant differences between social work and education: perhaps it’s unsurprising that these two vocational courses have similar patterns of usage.

Figure 5: Social science group

Finally, we turn to the arts group. This one might have been skewed a bit by the inclusion of music, which contains a couple of courses which might have fitted alongside English and drama in the humanities group as well as more technology-focused subjects which fit with the design courses. Musicians are heavier users than every other cluster on at least four dimensions. They borrow more items than all of the other clusters, in each case with a large effect. They are also higher users of electronic resources, particularly when compared to the 2D and 3D designers. Elsewhere, there are fewer significant differences. Architects do not visit the library very often compared to the other clusters in this group, but fashion designers do. It’s possible that the architects’ relatively low level of use is because they have a ‘Design Centre’ in the department, which offers access to computers, journals and other materials: the ‘Art and Design Resource area’ in the library has traditionally been more focused on textiles and fashion design, and might explain their higher number of library visits. And, as a final footnote, 3D designers are fond of e-resources.

Figure 6: Arts group

The main finding from this section of the analysis – that discipline has a big effect on patterns of library usage – might not be earth shattering. But it does provide statistical backing for something that many librarians will already know anecdotally or from their own observations. This could be a really useful starting point for conversations with academics – checking whether the low usage by their students is a cause for concern and, if so, what might be done to increase it.

And so, our exciting rollercoaster of findings gets underway with a quick look at some of the demographic factors which seem to affect usage of library resources. Now, I’ve already posted about some early findings, which hadn’t been tested for statistical significance. These new findings have, and I’m only showing the results which are significant: i.e. where we can be confident at the agreed level (which varies from test to test but is in every case a standard statistical method) that the results represent real differences within a wider population, and aren’t just a coincidence within the sample of data that we’ve got.

We’re looking here at final year students who have finished their degrees, were full-time students and whose courses were based at the Huddersfield campus. This helps us to exclude a few variables that might possibly confound our overall findings: for example, we might find that mature students have less library usage, but if lots of our mature students are part-time students (and we haven’t tested for this, so I don’t know – it’s just an example) then we wouldn’t be able to tell whether it’s their maturity or their part-time status that limits their use of the library. Now, one thing we haven’t been able to do is to control for the different variables that we want to test – so we don’t necessarily know whether, say, our Asian students are disproportionately male, and it’s their gender rather than their race that makes them use the library more often (again, just an example – don’t quote me on this). This is a bit of a problem, but it’s not unusual in statistics and with the sample that we’ve got there’s no way round it other than to shrug our shoulders and make sure we acknowledge this when we report the findings. (Hence this paragraph of caveat!)

For each finding, I’m showing the effect size. For the tests that we’ve used (Mann-Whitney U, fact fans), these are generally reckoned as follows: anything up to .3 is a small effect, between .3 and .5 is a medium sized effect, and anything over .5 is a large effect. (Ignore the minus signs by the way – they’re just a function of the test and don’t mean anything.) You’ll notice that most of the demographic variables only show small effect sizes but don’t worry – it gets a lot more exciting when we look at subjects.

I think most of the dimensions of use are fairly self-explanatory, but I should probably clarify what we mean by the three that refer to the ‘number of e-resources accessed’. This metric shows how many of Huddersfield’s 240-odd e-resources, which range from large journal platforms and databases down to individual journal subscriptions, a student has logged into during the year, once, at least five times or at least 25 times.

So, without any further ado, let’s crack on!

Figure 1: Age and usage

Figure 1 shows the relationship between age and usage. We’ve separated our students into mature – those who entered the university aged 21 or older – and non-mature students (it was VERY difficult to come up with a non-pejorative name for that group!). As you can see, mature students have higher library usage than non-mature students on every dimension except library visits and hours spent logged into the library PCs. Remember, these are all full-time students, so it’s not that they’re illicitly logging in from work rather than visiting the physical library. We wondered whether the mature students can afford their own laptops and therefore have less need of the library PCs: it might also have something to do with the way younger students treat the library as more of a social space to hang out with friends. Most departments also have resource centres where students can access computers, and may seem like a less daunting study environment for some students.

Figure 2: Gender and usage

Figure 2 looks at gender and usage. Again, we see differences here on almost every metric – hours logged into the library PCs and number of e-resources accessed 25 or more times are the only exceptions. And on almost every one, women are bigger users than men – the only exception is the number of visits to the library, where men dominate.

Now, we’re looking at something a bit different for the next few figures. Rather than comparing each group to every other, we’ve chosen one group as a control and compared all the rest to them (this is for reasons to do with our statistical methods). In each case, our control is simply the biggest group: this allows us to compare minorities with the majority and hopefully identify behaviour that might otherwise get lost.

Figure 3: Ethnicity and usage

Figure 3 looks at ethnicity and usage; the control group here is ‘white’. (For more information on how we constructed our ethnicity categories, go back to my earlier post.) You’ll notice that there are fewer significant differences here, and almost none to do with e-resources. The only exception is the number of e-resources accessed by Chinese students – we’ll come back to that. Asian students are big users of the libraries, and especially the library PCs, but aren’t using their high levels of use to borrow more items. Facebook, anyone?! Black students are also big library users, and they are borrowing more than their white counterparts. Chinese students, as we’ve said, are borrowing less and accessing fewer e-resources than white students: there may be an issue here to do with breadth of reading which (as we’ll see in a few posts’ time) is important.

Figure 4: Country of domicile and usage

Figure 4 looks at country of domicile (Huddersfield-ese for ‘where do you live when you’re not at university?’), with the control group being students based in the UK. There are more significant differences here on several dimensions of use. Notably, students from the new EU (member states which joined in 2004 or later) are very keen on computers: they spend more time logged into e-resources, download more content and use more resources more often than UK-based students. Students from the old EU (pre-2004 member states) and the very broad ‘rest of the world’ category visit the library less than UK students, but borrow more items and use the PCs more often when they are there. Students hailing originally from China show lower usage on a number of dimensions: alongside Figure 3, this suggests that Chinese students are systematically lower users of library resources.

All this is very interesting, but how useful is it in helping librarians to develop services that meet the needs of their users? In truth, it’s really only a first step. We now know where the differences in usage are, but we don’t know why they exist, and that’s what we need to understand if we are to tailor services to students. Maybe the Chinese students are getting all their information from alternative sources and so it doesn’t matter that their usage is much lower. Perhaps the high use of e-resources by students from the new EU doesn’t indicate thorough, broad reading but rather very inefficient search and discovery strategies. In order to really understand what’s going on behind the numbers, we will need to take a more qualitative approach, running focus groups and case studies to explore students’ behaviours and the reasons for those behaviours. Only then can we attempt to understand what we should do to ensure the library’s doing everything it can to support all its users.

I have spent the last few days looking at the relationship between using the library and dropping out of an undergraduate degree. I have some more results for you, but these ones are coming with a very big health warning – so, treat with caution!

You might remember that back in the very first blog I talked a bit about where your data comes from, and making sure that it is appropriate for the task at hand. Well, we’ve uncovered a slight problem with the Huddersfield data. Quite early on, it became clear that the information we had for ‘year of study’ was not accurate for every student. This is a challenge for two reasons. First, we don’t know whether usage patterns are the same for all three years of study, so we don’t want to compare different year groups without controlling for any variations that might be related to the year that they’re in. And second, we can’t use cumulative data to measure usage: the total usage of a third year student over their three years of study is likely to be higher than the total usage of a first year over their single year of study, but that doesn’t mean the student is actually more likely to use the library: it just means they’ve had more of an opportunity to do so.

Our initial workaround for this was to focus exclusively on final year students, who we could identify by checking to see whether they have a final grade or not. That was fine for the demographic stuff; it gave us a decent sized sample. But now we’ve moved on to look at dropouts, things have become a bit more complicated. The lack of ‘year of study’ data has become a problem.

We can’t use our ‘third year only’ workaround here, because the way that we were identifying the relevant students was by looking for their final grade: and, of course, if you’ve dropped out, you probably won’t have one of those! Also, we needed the first and second year students involved, to make sure that we could get a big enough sample for the statistical tests.

So what did we do? Well, first of all we had to find a way around the ‘cumulative measure’ problem that I mentioned above. Our solution was to create a binary variable: either a student did use the library, or they didn’t. In general, it’s not desirable to reduce a continuous variable to a categorical one – it limits what you can understand from the data – but in this instance we couldn’t think of any other options.

Even this still doesn’t completely solve the problem. Students who have been at Huddersfield for three years have had three times as long to use the library as those who’ve only been there for a year: in other words, they’ve had three times as long to move themselves from the ‘no usage’ category to the ‘usage’ one – because it only takes one instance of use to make that move. To try and get around this, we only looked at data for 2010-11 – the last year in our dataset. For the same reason, we’ve only included people who dropped out in their third term: this means that they will have had at least two full terms to achieve some usage. Now, this is only a partial solution: we still don’t know whether usage patterns are different in first, second, and third years, so we are still not comparing apples and apples. But we’re no longer comparing apples and filing cabinets.

Furthermore, before I share the findings (and I will, I promise!) I must add that we could only test this association for a combination of full- and part-time students. Once you separate them out, the test becomes unreliable: the sample of dropouts is too small. So we don’t know whether one group is exerting an unreasonable influence within the sample.

In short, then, there’s a BIG health warning attached to these findings. We can’t control for the year of study, and so if there’s a difference in usage patterns for first, second and third years, our findings will be compromised. We can’t control for full- vs part-time students, so if there’s a difference in their usage patterns our findings will be compromised. Both of these are quite likely. But I think it’s still worth reporting the findings because they are interesting, and worthy of further investigation – perhaps by someone with more robust data than we have.

So – here we go!

First up – e-resource usage. The findings here are pretty clear and highly significant. If you do not use the library, you are over seven times more likely to drop out of your degree: 7.19, to be precise. If you do not download PDFs, you are 7.89 times more likely to drop out of your degree.

Library PC usage also has a relationship with dropping out, although in this case not using the PCs makes you 2.82 times more likely to drop out of your degree.

We tried testing to see whether there was a relationship between library visits and dropping out, and item loans and dropping out, but in these cases the sample didn’t meet the assumptions of the statistical test: we can’t say anything without a bigger sample.

So why are these results interesting? Well, it’s not because we can now say that using the library prevents you from dropping out of your course – remember, correlation is not causation and nothing in this data can be used to suggest that there is a causal relationship. What it does offer, though, is a kind of ‘early warning system’. If your students aren’t using the library, it might be worth checking in with them to make sure everything is alright. Not using the library doesn’t automatically mean you’re at risk of dropping out – in fact, the number of students who don’t use the library and don’t drop out is much, much higher than the number who do leave their course prematurely. But library usage data could be another tool within the armoury of student support services, one part of a complex picture which helps them to understand which students might be at risk of failing to complete their degree.

It’s been a long, long time since the last blog. Sorry! We have been busy – more of which later. But for now, I’d like to share some initial results from the work we’ve been doing looking at time-of-day of usage and outcome.

Previous work by Manchester Metropolitan University suggested that overnight usage is linked to poor grades for students. This makes intuitive sense: surely it can’t be a good sign if someone based in the UK is using library resources at 3am on a regular basis – that looks like someone who is struggling with their workload. We wanted to test whether this finding would hold true with the Huddersfield data.

At first glance it seemed like it might. Look at this graph, showing the e-resource usage patterns for students with all four grade outcomes over a 24 hour period*. The firsts have much higher usage than anyone else during the daytime; at night, although the timings become similar, the students who achieved thirds seem to have the highest use (around 4-5am).

But that’s not really a fair comparison. We know from the Phase 1 work that there’s a positive relationship between usage and outcome – in other words, the higher your grade, the higher your usage. So maybe what the first graph is showing is just that people with firsts have higher usage – full stop. The difference isn’t related to time of day at all.

To test this, we created a new variable, looking at the proportion of usage that happened within a certain hour, measured as a percentage of overall usage. This eliminates any bias which might emerge by considering overall usage. For example, a student who logged in to e-resources 20 times, 2 of them between 10-11am would have the same figure (10%) as a student who logged into e-resources 400 times, 40 of them between 10-11am. The much higher overall number of the second student becomes irrelevant, and we’re able to look at a true reflection of usage patterns. The following graph shows what happens.

What a difference! The lines are almost identical. Again, you can see that there’s a point around 9pm where users who go on to achieve a third overtake users who go onto achieve a first, and they maintain their dominance around 3-7am. But the overall message is that patterns of e-resource use don’t differ very much by outcome, it’s just the volume.

We tested these findings and, sure enough, although there is a statistically significant difference in overall usage between 9pm and 9am for researchers with different grades, there is no statistically significant difference in the percentage of overall use which takes place within that timeframe.

So we’ve found something different from Manchester Met. I should stress that our methodology is different, and more simplistic, so it may just be that we haven’t managed to identify a relationship that does exist – perhaps, once you factor out lots of the other variables which affect outcome, a relationship can be detected. But you’d need a lot more data to do that, and a much more complicated model.

As a final footnote – what’s happening around 8pm in the proportion of usage graph?! We’ve tentatively termed this the ‘Eastenders gap’: high-achieving students seem to have a much greater resistance to TV-soap-based distractions than those who go on to get a Third! One to explore in the focus groups, perhaps…

* When we say ‘usage patterns over a 24 hour period’, we’re actually aggregating the whole year’s data. So 3 logins between 10-11am means a user has logged in 3 times during that hour over the course of a year. We don’t count multiple logins in a single hour on the same day (e.g. someone logs in, times out, and logs in again – that’s just one login as far as we’re concerned).