Category Archives: Data Analysis

LIDP Toolkit: Phase 2

We are starting to wrap up the loose ends of LIDP 2. You will have seen some bonus blogs from us today, and we have more about reading lists and focus groups to come – plus more surprises!

Here is something we said we would do from the outset – a second version of the toolkit to reflect the work we have done in Phase 2 and to build on the Phase 1 Toolkit:

Stone, Graham and Collins, Ellen (2012) Library Impact Data Project Toolkit: Phase 2. Manual. University of Huddersfield, Huddersfield.

The second phase of the Library Impact Data Project set out to explore a number of relationships between undergraduate library usage, attainment and demographic factors. There were six main work packages:

  1. Demographic factors and library usage: testing to see whether there is a relationship between demographic variables (gender, ethnicity, disability, discipline etc.) and all measures of library usage;
  2. Retention vs non-retention: testing to see whether there is a relationship between patterns of library usage and retention;
  3. Value added: using UCAS entry data and library usage data to establish whether use of library services has improved outcomes for students;
  4. VLE usage and outcome: testing to see whether there is a relationship between VLE usage and outcome (subject to data availability);
  5. MyReading and Lemon Tree: planning tests to see whether participation in these social media library services had a relationship with library usage;
  6. Predicting final grade: using demographic and library usage data to try and build a model for predicting a student’s final grade.

This toolkit explains how we reached our conclusions in work packages 1, 2 and 6 (the conclusions themselves are outlined on the project blog. Our aim is to help other universities replicate our findings. Data were not available for work package 4, but should this data become available it can be tested in the same way as in the first phase of the project, or in the same way as the correlations outlined below. Work package 6 was also a challenge in terms of data, and we made some progress but not enough to present full results.

The toolkit aims to give general guidelines about:

1. Data Requirements
2. Legal Issues
3. Analysis of the Data
4. Focus Groups
5. Suggestions for Further Analysis
6. Release of the Data

BONUS findings post! Correlations


Hello! You’ve changed the format of the blogs, haven’t you? Yes, we like to mix it up a bit. We thought this might be fun, and not at all derivative.

What are these bonus findings then? Well, it all stems from the fact that this time round we have been given students’ final grades as a percentage, rather than a class. Continuous rather than categorical data. This opens up a whole new world of possibilities in terms of identifying a relationship between usage and grades.

Wait, didn’t you already prove that in Phase 1? We certainly did.

So why are you doing it again? Well, to not-quite-quote a famous mountaineer – because we can. It’s important to be clear that we’re not trying to ‘prove’ or ‘disprove’ results from the previous phase. Those stand alone. We’re simply taking advantage of the possibilities offered by the new data.

And those possibilities are…? Remember Spearman’s correlation coefficient from the last post? Well, we can use that again. As you’ll remember from earlier posts, it’s best to keep continuous data continuous if you can. The first round of the project gave librarians with percentage grades – continuous data – a methodology which required them to convert said grades into classes – categorical data. So we’re outlining this technique for their benefit – it’ll save time AND it’s better!

But if you’ve only got the class-based data Not a problem! Use the old technique, which is designed for class-based data. This is just about giving people options so that they can choose whatever fits their data best.

Right. Got it. So, what did you find? This might be where you have to take a bit of a back seat, my inquisitive friend.

In fact, we found  absolutely nothing to surprise us. The findings echo everything we established in the first phase, and the additional work we’ve done with extra variables in this phase. Figure 1 shows the effect sizes and significance levels for each variable.

As usual, I’ve only reported the statistically significant ones, and they are exactly the same as the ones that were statistically significant in our previous tests. You can see that, again, we’ve found a slight negative correlation between the percentage of e-resource use which happens overnight and the final grade. Once again, I’m inclined to dismiss this as a funny fluke within the data, rather than an indication that overnight usage will improve your grade.

So nothing new to report? Not really. Just a new method (outlined in the toolkit) for those librarians who want to take advantage of their continuous datasets.

FINDINGS post 6: BONUS new findings on UCAS points and place of usage

Good news! We have acquired some extra data from various very helpful departments at Huddersfield, which allows us to explore a few additional angles for both indicators of usage and the relationship between usage and outcomes.  The new data relates to the UCAS points of the students within the study, and to the percentage of each student’s total e-resource usage that occurs on campus.

Let’s start with the UCAS points. We’ve treated these as a kind of ‘demographic’ characteristic and, as with the ones we’ve already posted about, we wanted to see whether there was a correlation between them and the library usage variables. Because both the UCAS points and the usage data are continuous variables, but not normally distributed, we used a slightly different measure than in our previous work – Spearman’s correlation coefficient. Figure 1 shows the findings.

I’ve only included effect sizes for the ones that were statistically significant and, as you can see, there aren’t many of them. And even for those that were, the effect sizes were pretty small. This is an interesting finding, suggesting that it’s not necessarily the case that high-achieving students are simply more likely to use the library and that this lies behind the relationship we’ve seen between usage and outcomes in phase 1 of the project – we’d have to run further tests to check whether this is in fact the case, of course.

Next, we moved on to look at the percentage of e-resource usage which occurred on Huddersfield’s campuses (as opposed to offsite – for example, at home or in a coffee shop). Again, as with the earlier measures of usage, we wanted to see whether this varied with some of the demographic variables. Figure 2 shows our findings: I’ve only included the ones with statistically significant variations.

Younger students and men spend proportionally more time than mature students and women accessing e-resources on campus (as compared to other locations). The same is true for Asian and black students, compared to white students;  for computing and engineering students compared to social scientists; and for students based in the UK compared to those based in China. Remember, these are just the ones with statistically significant differences, so in fact there are not that many – and they are all small effect sizes.

Finally – and perhaps most interestingly – the relationship between percentage of usage of e-resources on campus and final degree results. Figure 3 shows the findings from this analysis; again, only the statistically significant results are shown.

You can see that, although the effect sizes are small, there are differences between the percentage of on-campus usage for students who go on to achieve a First and 2.ii or Third, and a 2.i and a 2.ii. In each case, the students who go on to achieve higher grades are the ones who have had a lower percentage of on-campus e-resource usage. This tells us that there is some relationship between reading electronic content in locations other than the university and doing well in your degree.


In work package 6 (Data Analysis) we said we would investigate some in house projects:

Lemontree is designed to be a fun, innovative, low input way of engaging students through new technologies and increasing use of library resources and therefore, final degree awards. The project aims to increase usage of library resources using a custom social, game based eLearning platform designed by Running in the Halls, building on previous ideas such as those developed at Manchester Metropolitan University to support inductions and information literacy and uses rewards systems similar to those used in location based social networks such as Foursquare.

Stone, Graham and Pattern, David (2012) Knowing me…Knowing You: the role of technology in enabling collaboration. In: Collaboration in libraries and learning environments. Facet, London. ISBN 978-1-85604-858-3

When registering for Lemontree, students sign terms and conditions that allow their student number to be passed to Computing and Library Services (CLS). This allows CLS to track usage of library resources by Lemontree gamers versus students who do not take part. As part of LIDP 2, we wanted to see if we could analyse the preliminary results of Lemontree to investigate whether engagement with Lemon Tree makes a difference to student attainment by comparing usage and attainment of those using Lemon Tree with those that are not across equivalent groups in future years (we only planned to come up with a proof of concept from Phase 2)

Andrew Walsh, who is project managing Lemontree at Huddersfield reports,

Lemontree has slowly grown its user base over the first year of operation, finishing the academic year with 628 users (22nd May 2012), with large numbers registering within the first few weeks of the academic year 2012-2013 (over 850 users registered by 5th October 2012). This gives us a solid base from which we can identify active Lemontree users who will be attending university for the full academic year 2012-2013.

Lemontree currently offers points and awards for entering the library, borrowing and returning books and using online resources as well as additional social learning rewards, such as leaving reviews on items borrowed. The rewards are deliberately in line with the types of data we analysed in the first phase of LIDP.

We have seen healthy engagement with Lemontree with an average of 74 “events” per user in the first year, with an event being an action that triggers points awarding.

At the end of this academic year, we will identify those users registered for the full year and extract usage statistics for those students. Those who registered in their second or third years of studies will have their usage statistics compared to their first year of study, to see if engagement with Lemontree impacted on their expected levels of library usage. For those students registered at the start of their first year, we will investigate whether active engagement with the game layer has an impact compared to similar groups of students, such as course, UCAS points, etc., to see if early intervention using gamification can have an impact throughout a student’s academic course.

Library usage and dropping out

I have spent the last few days looking at the relationship between using the library and dropping out of an undergraduate degree. I have some more results for you, but these ones are coming with a very big health warning – so, treat with caution!

You might remember that back in the very first blog I talked a bit about where your data comes from, and making sure that it is appropriate for the task at hand. Well, we’ve uncovered a slight problem with the Huddersfield data. Quite early on, it became clear that the information we had for ‘year of study’ was not accurate for every student. This is a challenge for two reasons. First, we don’t know whether usage patterns are the same for all three years of study, so we don’t want to compare different year groups without controlling for any variations that might be related to the year that they’re in. And second, we can’t use cumulative data to measure usage: the total usage of a third year student over their three years of study is likely to be higher than the total usage of a first year over their single year of study, but that doesn’t mean the student is actually more likely to use the library: it just means they’ve had more of an opportunity to do so.

Our initial workaround for this was to focus exclusively on final year students, who we could identify by checking to see whether they have a final grade or not. That was fine for the demographic stuff; it gave us a decent sized sample. But now we’ve moved on to look at dropouts, things have become a bit more complicated. The lack of ‘year of study’ data has become a problem.

We can’t use our ‘third year only’ workaround here, because the way that we were identifying the relevant students was by looking for their final grade: and, of course, if you’ve dropped out, you probably won’t have one of those! Also, we needed the first and second year students involved, to make sure that we could get a big enough sample for the statistical tests.

So what did we do? Well, first of all we had to find a way around the ‘cumulative measure’ problem that I mentioned above. Our solution was to create a binary variable: either a student did use the library, or they didn’t. In general, it’s not desirable to reduce a continuous variable to a categorical one – it limits what you can understand from the data – but in this instance we couldn’t think of any other options.

Even this still doesn’t completely solve the problem. Students who have been at Huddersfield for three years have had three times as long to use the library as those who’ve only been there for a year: in other words, they’ve had three times as long to move themselves from the ‘no usage’ category to the ‘usage’ one – because it only takes one instance of use to make that move. To try and get around this, we only looked at data for 2010-11 – the last year in our dataset. For the same reason, we’ve only included people who dropped out in their third term: this means that they will have had at least two full terms to achieve some usage. Now, this is only a partial solution: we still don’t know whether usage patterns are different in first, second, and third years, so we are still not comparing apples and apples. But we’re no longer comparing apples and filing cabinets.

Furthermore, before I share the findings (and I will, I promise!) I must add that we could only test this association for a combination of full- and part-time students. Once you separate them out, the test becomes unreliable: the sample of dropouts is too small. So we don’t know whether one group is exerting an unreasonable influence within the sample.

In short, then, there’s a BIG health warning attached to these findings. We can’t control for the year of study, and so if there’s a difference in usage patterns for first, second and third years, our findings will be compromised. We can’t control for full- vs part-time students, so if there’s a difference in their usage patterns our findings will be compromised. Both of these are quite likely. But I think it’s still worth reporting the findings because they are interesting, and worthy of further investigation – perhaps by someone with more robust data than we have.

So – here we go!

First up – e-resource usage. The findings here are pretty clear and highly significant. If you do not use the library, you are over seven times more likely to drop out of your degree: 7.19, to be precise. If you do not download PDFs, you are 7.89 times more likely to drop out of your degree.

Library PC usage also has a relationship with dropping out, although in this case not using the PCs makes you 2.82 times more likely to drop out of your degree.

We tried testing to see whether there was a relationship between library visits and dropping out, and item loans and dropping out, but in these cases the sample didn’t meet the assumptions of the statistical test: we can’t say anything without a bigger sample.

So why are these results interesting? Well, it’s not because we can now say that using the library prevents you from dropping out of your course – remember, correlation is not causation and nothing in this data can be used to suggest that there is a causal relationship. What it does offer, though, is a kind of ‘early warning system’. If your students aren’t using the library, it might be worth checking in with them to make sure everything is alright. Not using the library doesn’t automatically mean you’re at risk of dropping out – in fact, the number of students who don’t use the library and don’t drop out is much, much higher than the number who do leave their course prematurely. But library usage data could be another tool within the armoury of student support services, one part of a complex picture which helps them to understand which students might be at risk of failing to complete their degree.

Time of day of usage and outcomes

It’s been a long, long time since the last blog. Sorry! We have been busy – more of which later. But for now, I’d like to share some initial results from the work we’ve been doing looking at time-of-day of usage and outcome.

Previous work by Manchester Metropolitan University suggested that overnight usage is linked to poor grades for students. This makes intuitive sense: surely it can’t be a good sign if someone based in the UK is using library resources at 3am on a regular basis – that looks like someone who is struggling with their workload. We wanted to test whether this finding would hold true with the Huddersfield data.

At first glance it seemed like it might. Look at this graph, showing the e-resource usage patterns for students with all four grade outcomes over a 24 hour period*. The firsts have much higher usage than anyone else during the daytime; at night, although the timings become similar, the students who achieved thirds seem to have the highest use (around 4-5am).

But that’s not really a fair comparison. We know from the Phase 1 work that there’s a positive relationship between usage and outcome – in other words, the higher your grade, the higher your usage. So maybe what the first graph is showing is just that people with firsts have higher usage – full stop. The difference isn’t related to time of day at all.

To test this, we created a new variable, looking at the proportion of usage that happened within a certain hour, measured as a percentage of overall usage. This eliminates any bias which might emerge by considering overall usage. For example, a student who logged in to e-resources 20 times, 2 of them between 10-11am would have the same figure (10%) as a student who logged into e-resources 400 times, 40 of them between 10-11am. The much higher overall number of the second student becomes irrelevant, and we’re able to look at a true reflection of usage patterns. The following graph shows what happens.

What a difference! The lines are almost identical. Again, you can see that there’s a point around 9pm where users who go on to achieve a third overtake users who go onto achieve a first, and they maintain their dominance around 3-7am. But the overall message is that patterns of e-resource use don’t differ very much by outcome, it’s just the volume.

We tested these findings and, sure enough, although there is a statistically significant difference in overall usage between 9pm and 9am for researchers with different grades, there is no statistically significant difference in the percentage of overall use which takes place within that timeframe.

So we’ve found something different from Manchester Met. I should stress that our methodology is different, and more simplistic, so it may just be that we haven’t managed to identify a relationship that does exist – perhaps, once you factor out lots of the other variables which affect outcome, a relationship can be detected. But you’d need a lot more data to do that, and a much more complicated model.

As a final footnote – what’s happening around 8pm in the proportion of usage graph?! We’ve tentatively termed this the ‘Eastenders gap’: high-achieving students seem to have a much greater resistance to TV-soap-based distractions than those who go on to get a Third! One to explore in the focus groups, perhaps…

* When we say ‘usage patterns over a 24 hour period’, we’re actually aggregating the whole year’s data. So 3 logins between 10-11am means a user has logged in 3 times during that hour over the course of a year. We don’t count multiple logins in a single hour on the same day (e.g. someone logs in, times out, and logs in again – that’s just one login as far as we’re concerned).

Our first results!

There’s a little bit more methodology here – but also some results! I promise!

I have been trying to organise the data so that we create some useful variables ready for analysis. As I said in my last post, you need to know what your data actually means before you start analysing it. But sometimes you will also need to manipulate it before you start your analysis, because the variables you’ve been given aren’t quite right for the analysis you want to do.

One of the aims of the project is to understand the characteristics of people who use the library in a bit more detail. Now we know there’s a relationship between usage and attainment, we need to know whether there are any groups who use the library less than others, so that interventions can be targeted towards them. We’re going to look at a number of factors, including gender (Dave’s already posted some early graphs on this), ethnicity, country of origin, age (again, Dave’s got some graphs) and discipline.

Now, to return to my point about manipulating the data – we have some very detailed information about these characteristics for students at Huddersfield. In fact, in some categories there are only half a dozen students. If we were to use this data, it might be possible for a reader to identify individual students, which is a BIG data protection no-no.

So instead, we need to aggregate up. And that raises all sorts of questions. What are the best groupings we can arrive at in order to (a) protect the students and (b) remain theoretically consistent. It might be, for example, that by aggregating the Chinese and the Black – Other categories we could reach a number which would guarantee students can’t be identified – but do we really think those groups have anything in common with one another? Probably not. So it doesn’t make a lot of sense to bung them in together when trying to understand their library usage.

Here’s some early results, looking at the total number of hours in which a student was logged into e-resources (via EZProxy), and the total number of PDFs that students downloaded, using the aggregates that I’ve chosen. First, the ethnicity ones:

These data are aggregated – each category, apart from Chinese, contains at least 2 different groups from the original data. If you look at the graphs showing the original datasets, which I can’t share here because some of the groups are too small, you can see that there are considerable differences between groups that we’ve aggregated. So that’s something we’ll need to bear in mind – that differences between the groups within our aggregates might be so big as to require us to rethink our groupings. If they don’t all look similar, perhaps it’s not helpful to group them together?

And here’s the same kind of graph, looking at country of domicile.

Again, that’s been aggregated up, in this case by Huddersfield in the original data that was sent to me. I think I’m going to have a play to try and find a happy medium between their highly-aggregated data and the very detailed country-level data, which would reveal individual student identities. I’m particularly interested in what’s going on with those high downloads in the European Community data. I don’t think it’s just an outlier, as there’s a reasonable sample so one or two high usage levels shouldn’t skew the overall picture. So is there a big difference between, say, Balkan states and older EU members?

I suspect that some librarians might be a bit uncomfortable with the idea of using such personal data about students. Ethnicity in particular can be a very sensitive issue, and we will have to tread carefully when we talk about differences in usage. But I think the point to remember is that all of this work is done with the aim of improving services to students. If we can use this data – on a completely anonymous basis (I have nothing to link the data points back to an individual student as the student numbers have been removed before the data was sent to me) – to understand what students are doing, and how it’s affecting their outcomes –and if we remain mindful of our need to protect student confidentiality – I think it can only be a good thing.

The next steps will be for us to test whether these differences are statistically significant – that is to say, whether they are a fluke that’s happened to occur in our particular sample, or whether we can say with statistical confidence that they exist in a wider general population as well. For that, we go beyond the familiar frustrations of Excel to the brave new world of …. SPSS! Wish me luck…

Borrowing – year on year

Now that we’ve extracted the usage data for 2010/11, we can update some of Huddersfield’s usage graphs from the first phase of the project 🙂

First of all, an update of the graph that shows the average total number of items loaned to honours degree graduates for the last 6 years. As before, for each graduate, we’re looking at the total number of items they borrowed over 3 years (i.e. the graduation year, and the 2 previous years):


…and updates of the graphs that show borrowing by year of study:

borrowing in year 1 only

borrowing in year 1

borrowing in year 2 only

borrowing in year 2

borrowing in year 3 only

borrowing in year 3

LIDP on the road in 2012

Ellen and I will be presenting a poster at the LIBER 41st Annual Conference, 27 June – 30 June 2012, University of Tartu, Estonia. We look forward to sharing the early results of Phase II of the project with you and hearing your thoughts.

Stone, Graham, Collins, Ellen and Pattern, David (2012) Digging deeper into library data: Understanding how library usage and other factors affect student outcomes. In: LIBER 41st Annual Conference, 27 June – 30 June 2012, University of Tartu, Estonia.

Library Impact Data Toolkit

We are pleased to release the LIDP Toolkit.

One of the outcomes of the project was to provide a toolkit to assist other institutions who may want to test their own data against our hypothesis. The toolkit aims to give general guidelines about:

1. Data Requirements
2. Legal Issues
3. Analysis of the Data
4. Focus Groups
5. Suggestions for Further Analysis
6. Release of the Data

LIDP data has been made available under the Open Data Commons Attribution License

If you have used this toolkit to look at your data, we ask you to share your data too. Please let us know and we will link to it from the project blog.

The Library Impact Data Project Toolkit

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License