Category Archives: Final blog

Final blog post: managing the data

In terms of the data, this project was both simpler and more complex than the first stage. The fact that we were only dealing with data from a single institution made a big difference: we didn’t have to worry about ensuring that our classifications of data matched across different institutional systems and we only had one set of people to talk to when the data wasn’t quite what we were expecting!

On the other hand, the data itself was considerably more complicated. We had a larger number of variables to manage, and in many cases the data didn’t present itself in the ideal format for analysis. So a lot of the work on this round of the project was taken up with recoding data – and that involves not just hours of shouting at Excel but also a bit of brainpower to decide how to create groups and sub-groups within different variables.

Let’s take, as an example, the ‘ethnicity’ variable. The data provided by Huddersfield divided students into 20 different ethnicity categories, some of which had only a handful of students in them. This was problematic for two reasons – first, the students might potentially be identifiable in such small groups, and second, it was very unlikely that statistical tests would reveal any meaningful differences at that level of detail. By grouping the 20 categories into 5 or 6, we protected student identities and increased the viability of our statistical tests.

Now, it’s important to stress that you must not simply regroup data in order to get a statistically significant result, trying lots of different groupings until you get one that gives you the numbers you want. We had to have some theoretical underpinning to our categorisation – and this is where the ‘brainpower’ I mentioned earlier comes in. We could have categorised everyone into ‘white’ and ‘non-white’ but this wouldn’t necessarily give us particularly useful information: it would not reveal any differences between non-white groups, which might be important when planning activities to support those various groups. So we were looking for a happy medium somewhere between 20 and 2 groups.

Luckily, this is a problem that other researchers have faced before, and we were able to borrow from some fairly standard categories (you can see them in more detail in this post). Even then, though, we had to scratch our heads a bit to map the Huddersfield categories onto the new ones. For example, we had several groups called things like ‘Black – Other’ and ‘Asian – Other’. Should we put these into one big ‘other’ category, or into ‘Black’ and ‘Asian’ respectively? In the end, we felt that we didn’t have enough information to put them into the more specific categories. For example, our ‘Asian’ category referred specifically to the Indian subcontinent, and we grouped China separately. But we didn’t know whether our ‘Asian – other’ students were of Indian or Chinese heritage, or whether they were from one which didn’t fit into our existing groups – Korean, or Thai, for example.

In some cases, we didn’t have standard categories to borrow from, and had to create our own. For example, our data held only the names of courses that students were following, not their discipline or school. So when trying to group by subject, we had to start from scratch, taking the 100-odd courses and grouping them to allow analysis. We did this with the help of the library staff: it’s fairly likely that if we’d used other Huddersfield staff members we would have ended up with slightly different classifications. In the same way, other institutions might have a different take on how best to organise such data, according to their own organisational set-ups.

All in all, arranging and managing the data was one of the more time-consuming elements of the project. This is something that is only likely to get trickier if the project were working with data from several institutions. This begins with coding the text-based data: for example, imagine that one institution records students as ‘Black – African’, another, as ‘African (Black)’ and another as ‘Black African’. It’s not too difficult to get Excel to recognise that these are all the same thing, but it would be very time consuming to do it for each response in a number of variables. And for some variables it might involve some intellectual effort to decide upon equivalence – discipline is a particularly good example of this. Each university will offer slightly different courses, and it will take work to map them onto each other.

After the variables have been coded and brought into alignment, another challenge exists in ensuring that analysis works for all partners. Take country of domicile as an example. Some universities may have a particularly large number of students from, say, the Middle East, and want to separate that out as an analytic category. But, given that we need to keep the overall numbers of analytic categories low in order for the analysis to work and that different universities are likely to have different areas of geographic interest, how do we decide on the best groupings?

We’ll have to have a think about how best to address these challenges, if the research is to progress to include a wider number of institutions who want to work together and pool their data.

FINDINGS post 5: Predicting outcomes

One of the original aims for this project was to see whether library usage data could be combined with other variables to build a model that might help predict student outcomes. At first we thought this mightn’t be possible – student results weren’t normally distributed, and this precluded any regression analysis. But, once we weeded out the part-time, non-Huddersfield-based students, we found that the results were normal. Hurrah! This is always a happy, happy moment for statistics bods.

So, I set out with the data that I had to try and build a model. The aim, in statistics, is to try and build the best predictive model using the fewest input variables. This is for several reasons: it’s practical (saves you from having to collect loads of data which doesn’t add much to the final prediction), it’s elegant, and it reduces the likelihood of problems such as multicollinearity, where two or more of your variables are correlated and which stuffs up the predictive power of your model. It’s also important that you only use variables which you have a strong theoretical reason for including – randomly picking variables to see which one gives you the best result is called data mining, and it’s Not Allowed. For these reasons, I only used the library usage variables which we’ve already shown to be related to outcome: e-resource hours, PDF downloads, number of resources accessed 1, 5 and 25 times, and number of items borrowed.

Regression analysis in SPSS gives you about a million different outputs, but the key one, in terms of understanding the predictive ability of the model, is the adjusted R². This figure ranges from 0 (terrible) to 1 (perfect, and highly suspicious!). I was taught that in the social sciences an adjusted R² of around .7 is considered pretty good going. The one we achieved with this model is .106. Oh dear.

Furthermore, there do seem to be some potential problems with multicollinearity. This isn’t hugely surprising – after all, we ought to expect the three variables for number of e-resources accessed to be related, especially the last two – if you’ve accessed something 25 times you have also accessed it 5 times after all! It may also be that there’s a relationship between the hours spent logged into e-resources and the number of resources accessed.

So I next tried a more parsimonious model; one which eliminated the variables for number of e-resources accessed 5 and 25 times. The adjusted R² for this model is .107 – fractionally better – and the multicollinearity seems to be less of a problem. But it’s still not a great predictor. For the next step, I added in the demographic variables that we have, and got a model with an adjusted R² of .177. This pretty much exhausted all the data that I had available to me, and I have therefore declared the answer to our original question to be – no, you can’t build a model to predict outcomes with the data available to this project.

Now, that’s not to say that nobody could. There are a lot of variables we haven’t included. In some cases this is because we don’t have the data – for example, UCAS points might be a useful indicator of a student’s potential on arrival and a good thing to include, if you have them. Others are because the data is simply too difficult to collect – how do you measure a teacher’s ability (as opposed to how that ability is perceived by their students!), or stressful life events that a student is dealing with at exam time?

It would be really interesting to see whether integrating the missing information such as UCAS points would improve the model – and whether we could try and find a way of measuring proxies for some of those other issues: contact with pastoral staff, student satisfaction surveys. Of course, this would raise a lot of ethical issues and couldn’t be done lightly. But it would be interesting…

FINIDNGS post 4: Final results and usage

In this round of analysis, we had some new metrics of library usage which weren’t available to researchers in the first round of the project. I’ve already blogged about one of them – overnight usage – and there’s more on that later in this post, but first I’d like to talk about the other three. These are – the number of e-resources accessed (as distinct from the hours spent logged into e-resources); the number of e-resources accessed 5 or more times, and the number of e-resources accessed 25 or more times. These metrics show how many of Huddersfield’s 240-odd e-resources, which range from large journal platforms and databases down to individual journal subscriptions, a student has logged into during the year, once, at least five times or at least 25 times.

You’ll already have seen these three dimensions in the posts on our demographic and subject-based analysis. But now I’m going to see whether there’s a relationship between use on these dimensions and final degree outcome, using the same methodology as Phase 1 of the project. Figure 1 shows the results.

Figure 1: Usage and final degree outcome

As you can see, there are quite a few statistically significant differences for the e-resource dimensions. Most of these are small effects, but the difference between a first and a third is medium sized for the number of e-resources accessed, and the number accessed five or more times. This is a really interesting finding. It suggests that breadth of reading – indicated by using a number of different e-resources – might be a particularly important factor in degree success, and leads to all kinds of questions about how the library might support students in reading widely.

You can see that we’ve found a difference for the percentage of overnight usage as well! Weirdly, this only pops up in relation to the difference between 2.i and 2.ii degrees, and it’s a miniscule effect. I’m inclined to dismiss this as a blip and go with our previous finding that there isn’t a significant difference between grades in terms of their overnight usage: with the same caveat that our model is different from the one used by Manchester Met, and thus perhaps not as able to identify nuanced differences.

FINDINGS post 3: Dropping out

I’ve already posted some early findings about usage and dropping out, showing that there is a relationship between not using the library and dropping out. As you might remember, these findings had a fairly big health warning attached to them.

Since that blog was posted, I think I’ve found another way to look at the relationship between non-use and dropping out, one which is a bit more nuanced. You might remember that I created a ‘binary’ dummy variable for use of the library, in order to get around the problem of cumulative usage (we’d expect a third year student to have higher usage than a first year student, simply because they’ve been at the university longer, but we didn’t have access to the year-of-study data that would allow us to correct for this). I mentioned at the time that turning a continuous variable into a categorical one isn’t an ideal solution, and it turns out that a better one was staring me in the face the whole time (this, I find, is a frequent joy of statistical analysis!).

In that very same blog, I said that we were only looking at the most recent year of data: this was to give first and third year students the same opportunity to move themselves from the ‘non-user’ to ‘user’ category. But (and I’m kicking myself now for not seeing this at the time) once you look only at the last year of data, you can use cumulative measures again! After all, if you only look at one year of data for every student, you eliminate the problem whereby third year students will have accumulated more use simply by virtue of the fact that they’ve been there longer. Those first two years cease to be relevant as they’re not counted within the measure! Of course, we still have the problem of different patterns of usage in years 1, 2 and 3, but I think we just have to shrug our shoulders and accept that.

So, I have run some new tests – Mann Whitney U again – using a cumulative measure of usage for the first two terms of the 2010-11 academic year. As you might remember, we’re only looking at people who dropped out in term three (again, this is to try and give the dropouts a decent amount of time to establish their usage patterns: including people who dropped out in their second week of term is going to skew our results as they may not even have had their library induction by that point!). This means that all the students included in this study were at the university in the first two terms, and they have all had exactly the same opportunity to accumulate usage.

You might remember that in the earlier analysis I mentioned that we couldn’t separate out the full-time and part-time students as the sample became too small. Well, the new test resolves that problem, at least in part. Figure 1 shows the effect sizes for the difference in usage between current and dropout students: in the first column, full-time and part-time students based on all sites (Huddersfield and its secondary campuses at Barnsley and Oldham); in the second column, full-time and part-time students based at Huddersfield only; and in the third column, full-time students based at Huddersfield.

Figure 1: Retention and usage

You can see that as we decrease the sample size by removing certain categories of student, some of the dimensions cease to be statistically significant. When we remove sites other than Huddersfield, library PC use stops being significant, and when we limit our analysis to full-time students, the number of library visits also stops being significant. This tallies with the findings from Phase 1 of the project, which found that items borrowed, e-resource hours and PDF downloads were the only dimensions to have a statistically significant relationship with final degree outcome.

The effect sizes here are all very small, but they are significant and they are there for the number of items borrowed, the hours logged into e-resources and the number of PDF downloads. This reinforces the findings from my earlier analysis; that non-usage of the library can’t be taken, on its own, as a predictor of dropping out, but that there is some kind of relationship which suggests that it could be a useful warning signal, taken in the context of other warning signals such as course tutor feedback, lack of attendance at lectures and so forth.

It might have been interesting to look at some of the other dimensions of use in relation to dropping out. Percentage of overnight use is one obvious one: we didn’t find a correlation between overnight use and poor grades, but perhaps there’s one between overnight use and dropping out? The number of e-resources accessed might also be interesting. We can’t do this with the Huddersfield data at the moment, because both of those dimensions contain information about the full year and it’s not possible to extract terms one and two for analysis. Maybe that’s one for Phase 3…?

FINDINGS post 2: discipline matters

We’re moving on in this second of our final blogs to look at the relationship between discipline and usage. There are some interesting – although not necessarily unexpected – findings here, and bigger effect sizes than we saw with the demographic variables. Again, I’ve only shown the differences that are statistically significant: effect sizes are highlighted by the depth of colour, with small effects pale, medium effects a little darker, and large effects very dark.

You might remember that we had to aggregate some of the demographic categories – such as ethnicity and country of origin – both to protect student confidentiality and to get meaningful results. The same is true of our discipline-based analysis. But, because we suspected that this was going to be quite an important area, we’ve decided to go a bit more granular. So, we’ve created two levels of analysis. Each of the 100-odd courses offered by Huddersfield to its full-time, undergraduate students has been classified into one of 17 ‘clusters’. These clusters have then been aggregated to form six ‘groups’. We can compare the groups to see some overarching differences, and then drill down to compare clusters within groups, to get a more detailed understanding.

It’s important to mention that the grouping work was done by librarians and student support staff, so it represents the relationships that they see between the different courses. I suspect that if we’d asked course tutors, lecturers or students themselves we might have seen slightly different combinations. Also, the grouping was slightly driven by numbers: we had to make sure that there were enough people in each category to make the statistical tests viable and to ensure that anonymity was protected.

Limitations duly mentioned, let’s go on to look at the findings. First, the aggregated subject groups. We used the social science group as our control, as it was the biggest. As you can see from Figure 1, it is also the higher user in most comparisons where significant differences exist: the only exceptions are the comparisons with the number of items borrowed and the number of e-resources accessed by health students. We think this might be because health students are often out on placements, limiting their opportunities to visit the library but making e-resources more important. Furthermore, e-resource use is heavily embedded into the nursing curriculum, and most students will have classes which require them to go into the library’s e-resources and search for items.

Figure 1: Aggregated subject groups

The overall takeaway from this figure, I think, is that computing and engineering students are less intensive users on a number of dimensions, with a medium effect for the number of items borrowed. Arts students are very low users compared to the social sciences, with medium effects on several dimensions.

Let’s move on now to look at how the smaller clusters relate to each other within the groups. Poor old science was in a group all by itself: it wasn’t possible to sub-divide this one so we’ll skip over it and look at health. This has been divided into nursing and other health disciplines (including subjects such as sports therapy, physiotherapy, podiatry and occupational therapy) and you can see from Figure 2 that nursing is a bigger user on pretty much every dimension. There’s a large effect for the number of e-resources accessed, suggesting that health students are reading more widely than their nursing counterparts. This might be because nurses are required to use a certain number of documents for some of their assignments – ten pieces of a specific kind of research for systematic reviews, or four journal articles for an early assignment; the wide use might represent their efforts to find exactly the right kind of resource. There are also medium-sized effects for library visits, hours logged into e-resources, number of e-resources accessed 5 or more times and number of PDF downloads and small effects for hours logged into library PCs and number of e-resources accessed 25 or more times; in each case, nurses are the higher users.

Figure 2: Health group

The computing and engineering group has been divided into – errr – computing and engineering! The differences here are fewer and smaller. Perhaps unsurprisingly, engineers use the library PCs more often – presumably the computing students are happily tapping away on their personal laptops. And computing students are downloading more PDFs. But on the whole, the behaviour of these two clusters is quite similar.

Figure 3: Computing and engineering group

 

Now we move onto the humanities group, which has been subdivided into three clusters: English, drama and media and journalism. The first thing to note is that there are no statistically significant differences between English and drama students. But there are differences between these two clusters and the media students, and where those differences exist the media students have the lower level of usage. Most of these differences relate to e-resource use: the English students have higher use on pretty much all the e-resource dimensions with medium-sized effects. The differences between media and drama are smaller.

Figure 4: Humanities group

Now, on to Figure 5 and the social science group, which looks colourful and complicated! The first thing to note is the number of cells which are shaded in dark colours: there are a lot of big effects within this group. Overall, behavioural sciences dominate. They have higher usage, at a statistically significant level, than every other cluster on at least one dimension. Business is the next-most-dominant discipline, although it’s worth noting that business students borrow fewer items than their colleagues in every other discipline except law, and that the effect sizes here are medium or large. Lawyers, in fact, have the lowest use compared to most subjects, but they do use the library, and especially its computers, more than their counterparts in social work and education. Finally, there are no significant differences between social work and education: perhaps it’s unsurprising that these two vocational courses have similar patterns of usage.

Figure 5: Social science group

Finally, we turn to the arts group. This one might have been skewed a bit by the inclusion of music, which contains a couple of courses which might have fitted alongside English and drama in the humanities group as well as more technology-focused subjects which fit with the design courses. Musicians are heavier users than every other cluster on at least four dimensions. They borrow more items than all of the other clusters, in each case with a large effect. They are also higher users of electronic resources, particularly when compared to the 2D and 3D designers. Elsewhere, there are fewer significant differences. Architects do not visit the library very often compared to the other clusters in this group, but fashion designers do. It’s possible that the architects’ relatively low level of use is because they have a ‘Design Centre’ in the department, which offers access to computers, journals and other materials: the ‘Art and Design Resource area’ in the library has traditionally been more focused on textiles and fashion design, and might explain their higher number of library visits. And, as a final footnote, 3D designers are fond of e-resources.

Figure 6: Arts group

The main finding from this section of the analysis – that discipline has a big effect on patterns of library usage – might not be earth shattering. But it does provide statistical backing for something that many librarians will already know anecdotally or from their own observations. This could be a really useful starting point for conversations with academics – checking whether the low usage by their students is a cause for concern and, if so, what might be done to increase it.

FINDINGS post 1: demographic differences

And so, our exciting rollercoaster of findings gets underway with a quick look at some of the demographic factors which seem to affect usage of library resources. Now, I’ve already posted about some early findings, which hadn’t been tested for statistical significance. These new findings have, and I’m only showing the results which are significant: i.e. where we can be confident at the agreed level (which varies from test to test but is in every case a standard statistical method) that the results represent real differences within a wider population, and aren’t just a coincidence within the sample of data that we’ve got.

We’re looking here at final year students who have finished their degrees, were full-time students and whose courses were based at the Huddersfield campus. This helps us to exclude a few variables that might possibly confound our overall findings: for example, we might find that mature students have less library usage, but if lots of our mature students are part-time students (and we haven’t tested for this, so I don’t know – it’s just an example) then we wouldn’t be able to tell whether it’s their maturity or their part-time status that limits their use of the library. Now, one thing we haven’t been able to do is to control for the different variables that we want to test – so we don’t necessarily know whether, say, our Asian students are disproportionately male, and it’s their gender rather than their race that makes them use the library more often (again, just an example – don’t quote me on this). This is a bit of a problem, but it’s not unusual in statistics and with the sample that we’ve got there’s no way round it other than to shrug our shoulders and make sure we acknowledge this when we report the findings. (Hence this paragraph of caveat!)

For each finding, I’m showing the effect size. For the tests that we’ve used (Mann-Whitney U, fact fans), these are generally reckoned as follows: anything up to .3 is a small effect, between .3 and .5 is a medium sized effect, and anything over .5 is a large effect. (Ignore the minus signs by the way – they’re just a function of the test and don’t mean anything.) You’ll notice that most of the demographic variables only show small effect sizes but don’t worry – it gets a lot more exciting when we look at subjects.

I think most of the dimensions of use are fairly self-explanatory, but I should probably clarify what we mean by the three that refer to the ‘number of e-resources accessed’. This metric shows how many of Huddersfield’s 240-odd e-resources, which range from large journal platforms and databases down to individual journal subscriptions, a student has logged into during the year, once, at least five times or at least 25 times.

So, without any further ado, let’s crack on!

Figure 1: Age and usage

Figure 1 shows the relationship between age and usage. We’ve separated our students into mature – those who entered the university aged 21 or older – and non-mature students (it was VERY difficult to come up with a non-pejorative name for that group!). As you can see, mature students have higher library usage than non-mature students on every dimension except library visits and hours spent logged into the library PCs. Remember, these are all full-time students, so it’s not that they’re illicitly logging in from work rather than visiting the physical library. We wondered whether the mature students can afford their own laptops and therefore have less need of the library PCs: it might also have something to do with the way younger students treat the library as more of a social space to hang out with friends. Most departments also have resource centres where students can access computers, and may seem like a less daunting study environment for some students.

Figure 2: Gender and usage

Figure 2 looks at gender and usage. Again, we see differences here on almost every metric – hours logged into the library PCs and number of e-resources accessed 25 or more times are the only exceptions. And on almost every one, women are bigger users than men – the only exception is the number of visits to the library, where men dominate.

Now, we’re looking at something a bit different for the next few figures. Rather than comparing each group to every other, we’ve chosen one group as a control and compared all the rest to them (this is for reasons to do with our statistical methods). In each case, our control is simply the biggest group: this allows us to compare minorities with the majority and hopefully identify behaviour that might otherwise get lost.

Figure 3: Ethnicity and usage

Figure 3 looks at ethnicity and usage; the control group here is ‘white’. (For more information on how we constructed our ethnicity categories, go back to my earlier post.) You’ll notice that there are fewer significant differences here, and almost none to do with e-resources. The only exception is the number of e-resources accessed by Chinese students – we’ll come back to that. Asian students are big users of the libraries, and especially the library PCs, but aren’t using their high levels of use to borrow more items. Facebook, anyone?! Black students are also big library users, and they are borrowing more than their white counterparts. Chinese students, as we’ve said, are borrowing less and accessing fewer e-resources than white students: there may be an issue here to do with breadth of reading which (as we’ll see in a few posts’ time) is important.

Figure 4: Country of domicile and usage

Figure 4 looks at country of domicile (Huddersfield-ese for ‘where do you live when you’re not at university?’), with the control group being students based in the UK. There are more significant differences here on several dimensions of use. Notably, students from the new EU (member states which joined in 2004 or later) are very keen on computers: they spend more time logged into e-resources, download more content and use more resources more often than UK-based students. Students from the old EU (pre-2004 member states) and the very broad ‘rest of the world’ category visit the library less than UK students, but borrow more items and use the PCs more often when they are there. Students hailing originally from China show lower usage on a number of dimensions: alongside Figure 3, this suggests that Chinese students are systematically lower users of library resources.

All this is very interesting, but how useful is it in helping librarians to develop services that meet the needs of their users? In truth, it’s really only a first step. We now know where the differences in usage are, but we don’t know why they exist, and that’s what we need to understand if we are to tailor services to students. Maybe the Chinese students are getting all their information from alternative sources and so it doesn’t matter that their usage is much lower. Perhaps the high use of e-resources by students from the new EU doesn’t indicate thorough, broad reading but rather very inefficient search and discovery strategies. In order to really understand what’s going on behind the numbers, we will need to take a more qualitative approach, running focus groups and case studies to explore students’ behaviours and the reasons for those behaviours. Only then can we attempt to understand what we should do to ensure the library’s doing everything it can to support all its users.

The Final Blog Post

It has been a short but extremely productive 6 months for the Library Impact Data Project Team. Before we report on what we have done and look to the future, we have to say a huge thank you to our partners. We thought we would be taking a lot on at the start of the project in getting eight universities to partner in a six month project; however, it has all gone extremely smoothly and as always everyone has put in far more effort and work than originally agreed. So thanks go to all the partners, in particular:

Phil Adams, Leo Appleton, Iain Baird, Polly Dawes, Regina Ferguson, Pia Krogh, Marie Letzgus, Dominic Marsh, Habby Matharoo, Kate Newell, Sarah Robbins, Paul Stainthorp

Also to Dave Pattern and Bryony Ramsden at Huddersfield.

So did we do what we said we would do

Is there is a statistically significant correlation across a number of universities between library activity data and student attainment?

There answer is a YES!

There is statistically significant relationship between both book loans and e-resources use and student attainment. And this is true across all of the universities in the study that provided data in these areas. In some cases this was more significant than in others, but our statistical testing shows that you can believe what you see when you look at our graphs and charts!

Where we didn’t find a statistical significance was in entries to the library, although it looks like there is a difference between students with a 1st and 3rd, there is not an overall significance. This is not surprising as many of us have group study facilities, lecture theatres, cafes and student services in the library. Therefore a student is as just likely to be entering the library for the above reasons than for studying purposes.

We want to stress here again that we realise THIS IS NOT A CAUSAL RELATIONSHIP!  Other factors make a difference to student achievement, and there are always exceptions to the rule, but we have been able to link use of library resources to academic achievement.

So what is our output?

Firstly we have provided all the partners in the project with short library director reports and are in the process of sending out longer in-depth reports. Regrettably, due to the nature of the content of these reports, we cannot share this data; however, we are in the process of anonymising partners graphs in order to release charts of averaged results for general consumption

Furthermore we are also planning to release the raw data from each partner for others to examine. Data will be released on an Open Data licence at https://library.hud.ac.uk/blogs/projects/lidp/open-data/

Finally, we have been astonished by how much interest there has been in our project. To date we have two articles ready for publication imminently and have another 2 in the pipeline. In addition by the end of October we will have delivered 11 conference papers on the project. All articles and conference presentations are accessibly at: https://library.hud.ac.uk/blogs/projects/lidp/articles-and-conference-papers/

Next steps

Although this project has had a finite goal in proving or disproving the hypothesis, we would now like to go back to the original project which provided the inspiration. This was to seek to engage low/non users of library resources and to raise student achievement by increasing the use of library resources.
This has certainly been a popular theme in questions at the SCONUL and LIBER conferences, so we feel there is a lot of interest in this in the library community. Some of these ideas have also been discussed at the recent Business Librarians Association Conference

There are a number of ways of doing this, some based on business intelligence and others based on targeting staffing resources. However, we firmly believe that although there is a business intelligence string to what we would like to take forward, the real benefits will be achieved by actively engaging with the students to improve their experience. We think this could be covered in a number of ways.

  • Gender and socio-economic background? This came out in questions from library directors at SCONUL and LIBER. We need to re-visit the data to see whether there are any effects of gender, nationality (UK, other European and international could certainly be investigated) and socio-economic background in use and attainment.
  • We need to look into what types of data are needed by library directors, e.g. for the scenario ‘if budget cuts result in less resources, does attainment fall’? The Balanced Scorecard approach could be used for this?
  • We are keen to see if we add value as a library through better use of resources and we have thought of a number of possible scenarios in which we would like to investigate further:
    • Does a student who comes in with high grades leave with high grades? If so why? What do they use that makes them so successful?
    • What if a student comes in with lower grades but achieves a higher grade on graduation after using library resources? What did they do to show this improvement?
    • Quite often students who look to be heading for a 2nd drop to a 3rd in the final part of their course, why is this so?
    • What about high achievers that don’t use our resources? What are they doing in order to be successful and should we be adopting what they do in our resources/literacy skills sessions?
  • We have not investigated VLE use, and it would be interesting to see if this had an effect
  • We have set up meetings with the University of Wollongong (Australia) and Mary Ellen Davis (executive director of ACRL) to discuss the project further. In addition we have had interest from the Netherlands and Denmark for future work surrounding the improvement of student attainment through increased use of resources

In respect to targeting non/low users we would like to achieve the following:

  • Find out what students on selected ‘non-low use’ courses think to understand why students do not engage
  • To check the amount and type of contact subject teams have had with the specific courses to compare library hours to attainment (poor attainment does not reflect negatively on the library support!)
  • Use data already available to see if there is correlation across all years of the courses. We have some interesting data on course year, some courses have no correlation in year one with final grade, but others do. By delving deeper into this we could target our staffing resources more effectively to help students at the point of demand.
    • To target staffing resources
  • Begin profiling by looking at reading lists
    • To target resource allocation
    • Does use of resources + wider reading lead to better attainment – indeed, is this what high achievers actually do?
  • To flesh out themes from the focus groups to identify areas for improvement
    • To target promotion
    • Tutor awareness
    • Inductions etc.
  • Look for a connection between selected courses and internal survey results/NSS results
  • Create a baseline questionnaire or exercise for new students to establish level of info literacy skills
    • Net Generation students tend to overestimate their own skills and then demonstrate poor critical analysis once they get onto resources.
    • Use to inform use of web 2.0 technologies on different cohorts, e.g. health vs. computing
  • Set up new longitudinal focus groups or re-interview groups from last year to check progress of project
  • Use data collected to make informed decisions on stock relocation and use of space
  • Refine data collected and impact of targeted help
  • Use this information to create a toolkit which will offer best practice to a given profile
    • E.g. scenario based

Ultimately our goal will be to help increase student engagement with the library and its resources, which as we can now prove, leads to better attainment. This work would also have an impact on library resources, by helping to target our precious staff resources in the right place at the right time and to make sure that we are spending limited funds on the resources most needed to help improve student attainment.

How can others benefit?

There has been a lot of interest from other universities throughout the project. Some universities may want to take our research as proof in itself and just look at their own data; we have provided instructions on how to do this at https://library.hud.ac.uk/blogs/files/lidp/Documentation/DataRequirements.pdf. We will also make available the recipes written with the Synthesis project in the documentation area of the blog, we will be adding specific recipes for different library management systems in the coming weeks: https://library.hud.ac.uk/blogs/projects/lidp/documentation/

For those libraries that want to do their own statistical analysis, this was a was a complex issue for the project, particularly given the nature of the data we could obtain vs. the nature of the data required to specifically find correlations. As a result, we used the Kruskal Wallis (KW) test, designed to measure whether there are differences between groups of non-normally distributed data. To confirm non-normal distribution, a Kolmogorov-Smirnov test was run. KW unfortunately does not tell us where differences are, the Mann Whitney test was used on specific couplings of degree results, selected based on visual data represented in boxplot graphs. The number of Mann Whitney tests have to be limited as the more tests conducted, the higher the significance value required, so we limited them to three (at a required significance value of 0.0167 (5% divided by 3)). Once Mann Whitney tests had been conducted, effect size of the difference was calculated. All tests other than effect size were run in PASW 18; effect size was calculated manually. It should be noted that we are aware the size of the samples we are dealing with could have indicated relationships where they do not exist, but we feel our visual data demonstrates relationships that are confirmed by the analytics, and thus that we have a stable conclusion in our discarding of the null hypothesis that there is no relationship between library use and degree result.

Full instructions of how the tests were run will first be made available to partner institutions and disseminated publicly through a toolkit in July/August

Lessons we learned during the project

The three major lessons learned were:

Forward planning for the retention of data. Make sure all your internal systems and people are communicating with each other. Do not delete data without first checking that other parts of the University require the data. Often this appears to be based on arbitrary decisions and not on institutional policy. You can only work with what you’re able to get!

Beware e-resources data. We always made it clear that the data we were collecting for e-resource use was questionable, during the project we have found that much of this data is not collected in the same way across an institution, let alone 8! Athens, Shibboleth and EZProxy data may all be handled differently – some may not be collected at all. If others find that there is no significance between e-resources data and attainment, they should dig deeper into their data before accepting the outcome.

Legal issues. For more details on this lesson, see our earlier blog on the legal stuff

Final thoughts

Although this post is labelled the final blog post, we will be back!

We are adding open data in the next few weeks and during August we will be blogging about the themes that have been brought out in the focus groups.

The intention is then to use this blog to talk about specific issues we come across with data etc. as we carry our findings forward. At our recent final project meeting, it was agreed that all 8 partners would continue to do this via the blog.

Finally a huge thank you to Andy McGregor for his support as Programme Manager and to the JISC for funding us.