JISC MOSAIC Project
Background
The Project continues on from the JISC funded TILE Project and,
amongst other things, will be investigating the benefits of mining usage data from multiple university libraries.
Usage License
The Perl scripts linked to from this page are released under a CC0
licence and are provided "as is", with no strings attached.
Data Files
In order to generate the XML version of the circulation data, suitable for submitting to the project, you will
need to generate 4 separate text files (along with an extra optional file which contains exclusions). In each case,
the format of the data in the files is TSV
(i.e. fields separated by a tab character, and a newline separating each row of data).
As the contributing libraries may be submitting mutliple years worth of data, it is possible to prepare separate
user, transaction and exclusion data per academic year (denoted by <year> in the file name).
In the examples below, the → character prepresents a tab character. A * means that the field is required.
user file: users.<year>.txt
FIELDS:
* user ID
course ID
* progression level
SAMPLE:
67890 → ABC123 → UG3
45678 → DEC987 → PhD2
76543 → → staff
The user ID is whatever ID you want to use to identify an individual library user. It will be converted to a
MD5 hash value before the data is submitted to MOSAIC. It must match the user ID contained in the transaction file.
The course ID is whatever ID or code you use to identify a course that a student studies on. It must match the
course ID in the course file. For library users who are not on a course (e.g. staff), the value can
be blank.
The progression level value is taken from the pre-defined list in the MOSAIC documentation.
transaction file: transactions.<year>.txt
FIELDS:
* timestamp
* item ID
* user ID
SAMPLE:
1222646400 → 114784 → 67890
1225756800 → 103828 → 67890
1225756800 → 62580 → 76543
The timestamp is in Unix time format (i.e. the number
of seconds since 1st Jan 1970 UTC). It is used to calculate the day the transaction occurred on.
The user ID is whatever ID you want to use to identify an individual library user. It will be converted to a
MD5 hash value before the data is submitted to MOSAIC. It must match the user ID contained in the user file.
The item ID is whatever ID you want to use to identify a library book. It must match the item ID
contained in the item file.
item file: items.txt
FIELDS:
* item ID
* ISBN(s)
* title
author(s)
publisher
publication year
persistent URL
SAMPLE:
123 → 0415972531 → Music & copyright → L. Marshall → Wiley → 2004 → http://libcat.hud.ac.uk/123
234 → 0415969298 → Songwriting tips → N. Skilbeck → Phaidon → 1997 → http://libcat.hud.ac.uk/234
The item ID is whatever ID you want to use to identify a library book. It must match the item ID
contained in the item file.
The ISBN(s) are one (or more) ISBNs, separated by a | pipe character where more than one ISBN is linked to the item
(e.g. 0415966744|0415966752).
The title is the title of the book.
The author(s) are one (or more) names, separated by a | pipe character where more than one name
is present (e.g. John Smith|Julie Johnson).
The publisher and publication year are the name of the publishing company and the year of publication.
The persistent URL is the web address the item can be found at (e.g. on your library catalogue).
course file: courses.txt
FIELDS:
* course ID
* course title
course code(s)
SAMPLE:
AE110 → BA(H) English & Media → QP33
AE120 → BA(H) Drama & English → W440|W4W3|WP43|WQ43
AE200 → BA(H) English Language PT →
The course ID is whatever ID or code you use to identify a course that a student studies on. It must match the
course ID in the user file.
The course title is the human readable name of the course.
The course code(s) is a list of zero (or more) UCAS or JACS course codes. A | pipe character
should be used to separate multiple values.
optional exclusion file: exclude<year>.txt
FIELDS:
* match point
* value
SAMPLE:
course → AE100
user → 12345
item → 67890
prog → PhD3+
coprog → AE100|PhD3+
Although it is possible to simply exclude any data you do not want to submit to MOSAIC by not including it in any of the
above files, you can also specify specific value to be excluded as the files are parsed.
The match point can be one of 5 values...
course = a specific course ID
user = a specific user ID
item = a specific item ID
prog = a specific progression level
coprog = a specific course and progression level combination
For example, if you have a borrowed account for handling inter-library loans, you may want to exclude it from the
data submitted to MOSAIC. Alternatively, if a certain course only has a single student on it, you may wish to
exclude that course to ensure that the borrowing habits of that individual are not exposed.
The value is the relevant value to match. For coprog values, specify the course ID and
progression level with a | pipe character inbetween.
Perl Scripts
These scripts are released under a CC0 licence and are provided "as is", with no strings attached.
data2xml.pl
Notes:
- Various options can be configured in the MAIN VARIABLES section, although some of these can be overriden on the command line, e.g...
- perl data2xml.pl 2007 -- process the 2007 data (e.g. users.2007.txt, transactions.2007.txt etc)
- perl data2xml.pl 2007 2 -- process the 2007 data and generate a level 2 XML file
- perl data2xml.pl 2007 2 2000 -- process the 2007 data and generate level 2 XML files with up to 2,000 records in each file
- The XML filename is based on the options -- e.g. mosaic.2005.level1.1244486113.0000001.xml is level 1 data from 2005.
- You can choose to generate a debug file which will list details of transactions that have been ignored or excluded.
Comments
If you have any comments, feedback, questions, etc, please send them to the Library Systems Manager, Dave Pattern (d.c.pattern<at>hud.ac.uk).
This document was last updated on 08/Jun/2009 at 19:40pm