Archive for the ‘Database’ Category

Behind the Copac record 2: MODS and de-duplication

Written by bethan : without comments

We left the records having been rigorously checked for MARC consistency, and uploaded to the MARC21 database used for the RLUK cataloguing service. Next they are processed again, to be added to Copac.

One of the major differences between Copac and the MARC21 database is that the Copac records are not in MARC21. They’re in MODS XML, which is

an XML schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. It is a derivative of the MARC 21 bibliographic format (MAchine-Readable Cataloging) and as such includes a subset of MARC fields, using language-based tags rather than numeric ones.

Copac records are in MODS rather than MARC because Copac records are freely available for anyone to download, and use as they wish. The records in the MARC21 database are not – they remain the property of the creating library or data provider. We couldn’t offer MARC records on Copac without getting into all sorts of copyright issues. Using MODS also means we have all the interoperability benefits of using an XML format.

Before we add the records to Copac we check local data to ensure we’re making best use of available local holdings details, and converting local location codes correctly. Locations in MARC records will often be in a truncated or coded form, eg ‘MLIB’ for ‘Main Library’. We make sure that these will display in a format that will be meaningful to our users.
Click for larger version
It is also at this point that we do the de-duplication of records for Copac. Now, Copac de-duplication garners very mixed reactions: some users think we aren’t doing enough de-duplication; and occasionally we get told that we’re doing too much! We can’t ever hope to please everyone, but we’re aware that the process isn’t perfect, and we’ll be reviewing and updating deduplication during the reengineering. We will also be exploring FRBR work level deduplication.
As I’ve mentioned in an earlier blog post , we don’t de-duplicate anything published pre-1801. So what do we do for the post-1801 records?

As new records comes in we do a quick and dirty match against the existing records using one or more of ISBN, ISSN, title key and date. This identifies potential matches which go through a range of other exact and partial field matches. The exact procedure will vary depending on the type of material, so journals (for instance) will go through a slightly different process than monographs.

Records that are deemed to be the same are merged and for many fields unique data from each record is indexed. This provides for enhanced access to materials eg. a wider range of subject headings than would be present in any of the original records. The deduplication process can thus result in the creation of a single enhanced record containing holdings details for a range of contributing libraries.

As we create the Copac records we also check for the availability of supplementary content information for each document, derived from BookData. We incorporate this into the Copac record further enhancing record content for both search and display, eg. a table of contents, abstract, reviews.

Because the deduplication process is fully automated it needs to err on the side of caution, otherwise some materials might disappear from view, subsumed into similar but unrelated works. This can mean records that appear to be self-evident duplicates to a searcher may be separated on Copac because of minor differences in the records. Changes made to solve one problem example could result in many other records being mis-consolidated. It’s a tricky balance.

However, there is another issue: the current load and deduplication is a relatively slow process. We have large amounts of data flowing onto the database everyday and restricted time for dealing with updates. Consequently, where a library has being making significant local changes to their data, and we get a very large update (say 50,000 records), then this will be loaded straight onto Copac without going through the deduplication process.

This means that the load will, almost certainly, result in duplicate records. These will disappear gradually as they are pulled together by subsequent data loads, but it is this bypassing of the deduplication procedure – in favour of timeliness, that results in many of the duplicate records visible on Copac. One of the aims of the reengineering is to streamline the dataload process, to avoid this update bottleneck, and improve overall duplicate consolidation levels.

So, that’s the Copac record, from receipt to display. We hope you’ve enjoyed this look behind the Copac records. Anything else you’d like to know about? Tell us in the comments!

Thanks to Shirley Cousins for the explanation of the de-duplication procedures

Written by bethan

February 18th, 2010 at 12:17 pm

Posted in Database

Tagged with , ,

Behind the Copac record

Written by bethan : with 2 comments

We’re going to be talking quite a lot about the Copac reengineering, including the move to FRBRise Copac, and in order for you to have some idea of how this is going to change what we do, you need to know what we do now.  So here’s a brief background on the life of a Copac record.

Records are sent to us by the contributing institutions, usually in MARC exchange format, which looks like this:

An unprocessed MARC exchange file

An unprocessed MARC exchange file

We then run this through programmes created by our wonderful programmers (and about which I know very very little, except that they’re fantastic and save both my eyes and my sanity), which create records that look like this:

A processed MARC file

A processed MARC file

This is much easier on the eye, which is fortunate, as this is the stage where I use the warning file (also generated by the program) to look through and track down any possible errors.  This is mainly only done when loading a new library – once a library has been loaded, we just keep an eye on their updates to identify any changes, or new issues that arise.

For instance, the warning file might say ‘WARNING: LONG NAME IN 100 MAY NOT BE PERSONAL NAME  REC 92765’.  I would then look up that record, and check whether the long name in the 100 is, in fact, a personal name, or if it is a corporate name and needs to be in a 110.

This program has been evolving ever since the start of Copac, and it’s now able to handle most changes with very little need for human intervention.  Therefore, when I see ‘WARNING: 700 ‘1 $aDaille, Jean, 1594-1670.’ CHANGED TO ‘1 $aDaille, Jean,$d1594-1670.’ ’, I know that I don’t need to do anything – that change is correct.
Some warnings do need looking at in more depth.  If I see a warning that says something along the lines of ‘WARNING: NO 245 IN REC 76932.  240 CONVERTED TO 245’, then I will look at the original record and the altered record to see if that change is correct.

At this stage we’ll also check if there are any generic fields being used in a local way, that notes are in the correct notes fields, and that all records have holdings information.  Note that we’re largely not in a position to assess the quality of the data in the fields – purely that the right sort of data is in the right fields.  We wouldn’t, for example, correct typos in author’s names or incorrect publication dates.  As well as the fact that doing so would require making judgements, and make the whole process simply unmanageable, the data on Copac belongs to the contributing libraries, and so they are the ones who would need to make any corrections to the content.  Thus, in general,  the only changes we would make are to the MARC structure (or occasionally to the encoding of special characters), to try to ensure standardised data for record sharing and for building Copac.  The  data content of the fields we leave exactly as they are.

Once we’re satisfied that all this is correct, the data is loaded onto the RLUK shared cataloguing database in MARC21 format, where it is available for use by RLUK members and customers.  Back in the Copac office, it’s time for another round of processing, before the data is loaded onto Copac.  More on that next time!

Written by bethan

January 26th, 2010 at 2:02 pm

Posted in Database

Tagged with , ,

Database update

Written by Ashley : without comments

We’ve had a recurrence of the problem I reported a month ago and so last night we installed an update to the database software we use. I’m told the update contains fixes relevant to the problems we have been experiencing, so here’s hoping it brings some increased reliability with it.

Please accept out apologies if you experienced some disruption last night while I was updating the software.

Written by Ashley

November 13th, 2009 at 10:29 am

Posted in Database

Tagged with ,

Yesterday’s loss of service

Written by Ashley : with one comment

I thought I’d write a note about why we lost the Copac service for a couple of hours yesterday.

The short of it is, that our database software hung when it tried to read a corrupted file in which it keeps track of sessions. The result was that everyone’s search process hung and so frustrated users kept re-trying their searches, which created more hung sessions until the system was full of hung processes and with no CPU or memory left. Once we had deleted the corrupted file, everything was okay.

The long version goes something like this… From what I remember, things started going pear-shaped a little before noon when the machine running the service started becoming unresponsive. A quick look at the output of top showed we had far more search sessions running than normal and that the system was almost out of swap space.

It wasn’t clear why this was happening and because the system was running out of swap it was very difficult to diagnose the problem. It was difficult to run programs from the command line as, more often than not, they immediately died with the message “out of memory.” I did manage to shutdown the web server in an effort to lighten the load and stop more search sessions being created. It was proving almost impossible to kill off the existing search sessions. In Unix a “kill -9″ on a process should immediately stop the process and release its memory back to the system. But yesterday a “kill -9″ was having no effect on some processes and those that we did manage to kill were being listed as “defunct” and still seemed to be holding onto memory. In the end we just thought it would be best to re-boot the system and hope that it would solve whatever the problem was.

It took ages for the system to shut itself down — presumably because the shutdown procedures weren’t working with no memory to work in. Anyway, it did finally reboot and within minutes of the system coming up it became overloaded with search sessions and ran out of memory again.

We immediately shut down the web server again. However, search sessions were still being created by people using Z39.50 and so we had to edit the system configuration files to stop inetd spawning more Z39.50 search sessions. Editing inetd.conf didn’t prove to be the trivial task it should have been, but we did get it done eventually. We then tried killing off the 500 or so search sessions that were hogging the system — and that proved difficult too. Many of the processes refused to die. So, after sitting staring at the screen for about 15 minutes, unable to run programs because there was no memory and wondering what on earth do we do now, the system recovered itself. The killed off processes did finally die, memory was released and we could do stuff again!

A bit of investigation showed that the search processes weren’t getting very far into their initialisation procedure before hanging or going into an infinite loop. I used the Solaris truss program to see what files the search process was reading and what system calls it was making. Truss showed that the process was going off into cloud cuckoo land just after reading a file the database software uses to track sessions. So I deleted that file and everything started working again! The file got re-created next time a search process ran — presumably the file had become corrupted.

Written by Ashley

October 15th, 2009 at 1:54 pm

Posted in Database, Interfaces

Tagged with ,

Persistent identifiers for Copac records

Written by Ashley : with 4 comments

If you know the record number of a Copac record, there is now a simple url that will return you the record in MODS XML format. The urls take the following form: http://copac.ac.uk/crn/<record-number>. For instance, the work “China tide : the revealing story of the Hong Kong exodus to Canada” has a Copac Record Number of 72008715609 and can be linked to with the url http://copac.ac.uk/crn/72008715609.

Over the next few weeks we’ll be looking at adding these links to the Copac Full record pages and also introducing links to Bookmarking web sites such as delicio.us.

Written by Ashley

September 30th, 2008 at 2:09 pm

Institute of Education reload

Written by Ashley : without comments

Last week we started re-loading the Institute of Education Library records. Due to the number of records involved it will take a little while to complete the operation and as of today approximately half of the records are visible in the Copac interfaces. The rest of the records should be available this time next week.

The re-load was required to enable better access to live circulation information from the Institute’s Library Management System.

Written by Ashley

September 29th, 2008 at 11:32 am

Posted in Database

Tagged with

Re-structuring the database

Written by Ashley : with 6 comments

We are thinking of changing the database we use to run Copac. The current database software we use is very good at what it does, which is free text searching, but it is proving problematical in other areas. For instance, it doesn’t know about Unicode or XML, which was okay some years ago when 7-bit ASCII was the norm, record displays were simple and there was less interest in inter-operating with other people and services. We have managed to shoehorn Unicode and XML into the database, though it doesn’t sit there easily and some pre- and/or post-processing is needed on the records.

The current database software doesn’t cope well with the number and size of records we are throwing at it. For instance, the limit on record size is too small and the number of records we have means the database has to be structured in such a way as makes updating slower than we would like. We’d also like a something with faster searching.

We haven’t decided what the replacement software is going to be, though we have been thinking about how a new Copac database might be structued…

De-duplication

Some people think we do too much de-duplication of our records, others think we do too little. So, we are thinking of having two levels of de-duplication, one at the the FRBR work level and another level of de-duplication broadly based on edition and format. The two levels would be linked in a 1 to n relationship. I.e. a FRBR level record would link to several edition level records. An edition level record would link back to one FRBR level record and also other edition level records which link to the same FRBR record. This would result in a three level hierarchy with the individual library records at the bottom. How this would translate in to a user interface is yet to be decided.

Holdings statements

We currently store local holdings information in with the main bibliographic record. Doing otherwise in a non-relational database would have been troublesome. The plan is to keep the holdings out of the bibliographic records and only pull it in when it is needed.

Updating

This should enable us to reduce the burden of the vast number of updates we have to perform. For instance, we sometimes receive updates from our contributing libraries of over 100,00 records and updates of over a quarter million records is not unknown. Our larger contributors send updates of around twenty thousand records on a weekly basis. We now have over 50 contributing libraries and that adds up to a lot of records every week that we need to push through the system.

Unfortunately for us, many of these updated records probably only have changes to local data and no changes to the bibliographic data. However, the current system means we have to delete it from the database and then add it back in. If a record was part of a de-duplicated set then that delete and add results in the de-duplicated record being rebuilt twice for probably no overall change to the bibliographic details.

So, the plan for a new system is that when a library updates a record we will immediately update our copy that stores the local data and mark for update the FRBR level and edition level records it is a part of. The updating of these de-duplicated record sets will be done off-line or during the small hours when the systems are less busy. If we can determine that an updated record had no changes to the bibliographic data then there would be no need to update the de-duplicated sets at all.

What now?

We think we know how we are going to do all the above and our next step is to produce a mock-up we can use to test our ideas…

Written by Ashley

August 18th, 2008 at 12:45 pm