Archive for the ‘loading’ tag

Behind the Copac record 2: MODS and de-duplication

Written by bethan : without comments

We left the records having been rigorously checked for MARC consistency, and uploaded to the MARC21 database used for the RLUK cataloguing service. Next they are processed again, to be added to Copac.

One of the major differences between Copac and the MARC21 database is that the Copac records are not in MARC21. They’re in MODS XML, which is

an XML schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. It is a derivative of the MARC 21 bibliographic format (MAchine-Readable Cataloging) and as such includes a subset of MARC fields, using language-based tags rather than numeric ones.

Copac records are in MODS rather than MARC because Copac records are freely available for anyone to download, and use as they wish. The records in the MARC21 database are not – they remain the property of the creating library or data provider. We couldn’t offer MARC records on Copac without getting into all sorts of copyright issues. Using MODS also means we have all the interoperability benefits of using an XML format.

Before we add the records to Copac we check local data to ensure we’re making best use of available local holdings details, and converting local location codes correctly. Locations in MARC records will often be in a truncated or coded form, eg ‘MLIB’ for ‘Main Library’. We make sure that these will display in a format that will be meaningful to our users.
Click for larger version
It is also at this point that we do the de-duplication of records for Copac. Now, Copac de-duplication garners very mixed reactions: some users think we aren’t doing enough de-duplication; and occasionally we get told that we’re doing too much! We can’t ever hope to please everyone, but we’re aware that the process isn’t perfect, and we’ll be reviewing and updating deduplication during the reengineering. We will also be exploring FRBR work level deduplication.
As I’ve mentioned in an earlier blog post , we don’t de-duplicate anything published pre-1801. So what do we do for the post-1801 records?

As new records comes in we do a quick and dirty match against the existing records using one or more of ISBN, ISSN, title key and date. This identifies potential matches which go through a range of other exact and partial field matches. The exact procedure will vary depending on the type of material, so journals (for instance) will go through a slightly different process than monographs.

Records that are deemed to be the same are merged and for many fields unique data from each record is indexed. This provides for enhanced access to materials eg. a wider range of subject headings than would be present in any of the original records. The deduplication process can thus result in the creation of a single enhanced record containing holdings details for a range of contributing libraries.

As we create the Copac records we also check for the availability of supplementary content information for each document, derived from BookData. We incorporate this into the Copac record further enhancing record content for both search and display, eg. a table of contents, abstract, reviews.

Because the deduplication process is fully automated it needs to err on the side of caution, otherwise some materials might disappear from view, subsumed into similar but unrelated works. This can mean records that appear to be self-evident duplicates to a searcher may be separated on Copac because of minor differences in the records. Changes made to solve one problem example could result in many other records being mis-consolidated. It’s a tricky balance.

However, there is another issue: the current load and deduplication is a relatively slow process. We have large amounts of data flowing onto the database everyday and restricted time for dealing with updates. Consequently, where a library has being making significant local changes to their data, and we get a very large update (say 50,000 records), then this will be loaded straight onto Copac without going through the deduplication process.

This means that the load will, almost certainly, result in duplicate records. These will disappear gradually as they are pulled together by subsequent data loads, but it is this bypassing of the deduplication procedure – in favour of timeliness, that results in many of the duplicate records visible on Copac. One of the aims of the reengineering is to streamline the dataload process, to avoid this update bottleneck, and improve overall duplicate consolidation levels.

So, that’s the Copac record, from receipt to display. We hope you’ve enjoyed this look behind the Copac records. Anything else you’d like to know about? Tell us in the comments!

Thanks to Shirley Cousins for the explanation of the de-duplication procedures

Written by bethan

February 18th, 2010 at 12:17 pm

Posted in Database

Tagged with , ,

Behind the Copac record

Written by bethan : with 2 comments

We’re going to be talking quite a lot about the Copac reengineering, including the move to FRBRise Copac, and in order for you to have some idea of how this is going to change what we do, you need to know what we do now.  So here’s a brief background on the life of a Copac record.

Records are sent to us by the contributing institutions, usually in MARC exchange format, which looks like this:

An unprocessed MARC exchange file

An unprocessed MARC exchange file

We then run this through programmes created by our wonderful programmers (and about which I know very very little, except that they’re fantastic and save both my eyes and my sanity), which create records that look like this:

A processed MARC file

A processed MARC file

This is much easier on the eye, which is fortunate, as this is the stage where I use the warning file (also generated by the program) to look through and track down any possible errors.  This is mainly only done when loading a new library – once a library has been loaded, we just keep an eye on their updates to identify any changes, or new issues that arise.

For instance, the warning file might say ‘WARNING: LONG NAME IN 100 MAY NOT BE PERSONAL NAME  REC 92765’.  I would then look up that record, and check whether the long name in the 100 is, in fact, a personal name, or if it is a corporate name and needs to be in a 110.

This program has been evolving ever since the start of Copac, and it’s now able to handle most changes with very little need for human intervention.  Therefore, when I see ‘WARNING: 700 ‘1 $aDaille, Jean, 1594-1670.’ CHANGED TO ‘1 $aDaille, Jean,$d1594-1670.’ ’, I know that I don’t need to do anything – that change is correct.
Some warnings do need looking at in more depth.  If I see a warning that says something along the lines of ‘WARNING: NO 245 IN REC 76932.  240 CONVERTED TO 245’, then I will look at the original record and the altered record to see if that change is correct.

At this stage we’ll also check if there are any generic fields being used in a local way, that notes are in the correct notes fields, and that all records have holdings information.  Note that we’re largely not in a position to assess the quality of the data in the fields – purely that the right sort of data is in the right fields.  We wouldn’t, for example, correct typos in author’s names or incorrect publication dates.  As well as the fact that doing so would require making judgements, and make the whole process simply unmanageable, the data on Copac belongs to the contributing libraries, and so they are the ones who would need to make any corrections to the content.  Thus, in general,  the only changes we would make are to the MARC structure (or occasionally to the encoding of special characters), to try to ensure standardised data for record sharing and for building Copac.  The  data content of the fields we leave exactly as they are.

Once we’re satisfied that all this is correct, the data is loaded onto the RLUK shared cataloguing database in MARC21 format, where it is available for use by RLUK members and customers.  Back in the Copac office, it’s time for another round of processing, before the data is loaded onto Copac.  More on that next time!

Written by bethan

January 26th, 2010 at 2:02 pm

Posted in Database

Tagged with , ,

Loading/updating to Copac: how easy do you find it?

Written by bethan : with 2 comments

As we have been using the same processes and documentation to handle the loading and updating of libraries for a while, we decided that it was time to ask for some feedback to ensure that we were making the process as easy as possible for the libraries involved.

We asked 9 of the most recently loaded libraries to respond to a short online survey, asking them about their experience of the load and update process, how useful they found the documentation, and whether they had any suggestions for improvement. We did have to emphasise that we were concerned only with the Copac side of the process; unfortunately we can’t do anything about how easy (or otherwise) libraries find it to extract data from their library management systems, although we do recognise this as a valid concern.

The results were very encouraging! Respondents were asked to rate how easy they found the load und update processes, and the vast majority replied that they found them either ‘easy or ‘very easy’, with only one library anticipating that they would find the update process difficult. Documentation was also considered very good, with one library saying that they found it ‘clear and easy to follow’.

It wasn’t all sunshine and flowers, however, as some libraries did comment that they hadn’t realised how long it would take to get the records loaded onto Copac, or how much time it would take them to extract their data. We realise that we need to do a better job of managing expectations here: while we do try to add catalogues as quickly as possible, it can sometimes take time to complete the process, and perhaps we aren’t clear enough about that.

General comments had the Copac staff blushing, as we were told that ‘support has always been excellent’, and ‘we found the process of having our records loaded easy at our end, and thank Copac staff for their help’. One library said that they were ‘just surprised how simple and straightforward the whole procedure turned out to be.’

Responses were kept anonymous, so we can’t tell who exactly we have to thank for all of this wonderful feedback, but we are very grateful for it all :)

If there are any libraries out there who would like to know more, or comment, please get in touch with us! We’d love to hear from existing members of the Copac community who would like to comment, or from libraries who would like to be a part of the community and would like to know more about the (very easy!) technical processes involved.

Written by bethan

January 22nd, 2009 at 11:30 am

Posted in Feedback

Tagged with , ,