Archive for the ‘de-duplication’ tag

Behind the Copac record 2: MODS and de-duplication

Written by bethan : without comments

We left the records having been rigorously checked for MARC consistency, and uploaded to the MARC21 database used for the RLUK cataloguing service. Next they are processed again, to be added to Copac.

One of the major differences between Copac and the MARC21 database is that the Copac records are not in MARC21. They’re in MODS XML, which is

an XML schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. It is a derivative of the MARC 21 bibliographic format (MAchine-Readable Cataloging) and as such includes a subset of MARC fields, using language-based tags rather than numeric ones.

Copac records are in MODS rather than MARC because Copac records are freely available for anyone to download, and use as they wish. The records in the MARC21 database are not – they remain the property of the creating library or data provider. We couldn’t offer MARC records on Copac without getting into all sorts of copyright issues. Using MODS also means we have all the interoperability benefits of using an XML format.

Before we add the records to Copac we check local data to ensure we’re making best use of available local holdings details, and converting local location codes correctly. Locations in MARC records will often be in a truncated or coded form, eg ‘MLIB’ for ‘Main Library’. We make sure that these will display in a format that will be meaningful to our users.
Click for larger version
It is also at this point that we do the de-duplication of records for Copac. Now, Copac de-duplication garners very mixed reactions: some users think we aren’t doing enough de-duplication; and occasionally we get told that we’re doing too much! We can’t ever hope to please everyone, but we’re aware that the process isn’t perfect, and we’ll be reviewing and updating deduplication during the reengineering. We will also be exploring FRBR work level deduplication.
As I’ve mentioned in an earlier blog post , we don’t de-duplicate anything published pre-1801. So what do we do for the post-1801 records?

As new records comes in we do a quick and dirty match against the existing records using one or more of ISBN, ISSN, title key and date. This identifies potential matches which go through a range of other exact and partial field matches. The exact procedure will vary depending on the type of material, so journals (for instance) will go through a slightly different process than monographs.

Records that are deemed to be the same are merged and for many fields unique data from each record is indexed. This provides for enhanced access to materials eg. a wider range of subject headings than would be present in any of the original records. The deduplication process can thus result in the creation of a single enhanced record containing holdings details for a range of contributing libraries.

As we create the Copac records we also check for the availability of supplementary content information for each document, derived from BookData. We incorporate this into the Copac record further enhancing record content for both search and display, eg. a table of contents, abstract, reviews.

Because the deduplication process is fully automated it needs to err on the side of caution, otherwise some materials might disappear from view, subsumed into similar but unrelated works. This can mean records that appear to be self-evident duplicates to a searcher may be separated on Copac because of minor differences in the records. Changes made to solve one problem example could result in many other records being mis-consolidated. It’s a tricky balance.

However, there is another issue: the current load and deduplication is a relatively slow process. We have large amounts of data flowing onto the database everyday and restricted time for dealing with updates. Consequently, where a library has being making significant local changes to their data, and we get a very large update (say 50,000 records), then this will be loaded straight onto Copac without going through the deduplication process.

This means that the load will, almost certainly, result in duplicate records. These will disappear gradually as they are pulled together by subsequent data loads, but it is this bypassing of the deduplication procedure – in favour of timeliness, that results in many of the duplicate records visible on Copac. One of the aims of the reengineering is to streamline the dataload process, to avoid this update bottleneck, and improve overall duplicate consolidation levels.

So, that’s the Copac record, from receipt to display. We hope you’ve enjoyed this look behind the Copac records. Anything else you’d like to know about? Tell us in the comments!

Thanks to Shirley Cousins for the explanation of the de-duplication procedures

Written by bethan

February 18th, 2010 at 12:17 pm

Posted in Database

Tagged with , ,

Re-structuring the database

Written by Ashley : with 6 comments

We are thinking of changing the database we use to run Copac. The current database software we use is very good at what it does, which is free text searching, but it is proving problematical in other areas. For instance, it doesn’t know about Unicode or XML, which was okay some years ago when 7-bit ASCII was the norm, record displays were simple and there was less interest in inter-operating with other people and services. We have managed to shoehorn Unicode and XML into the database, though it doesn’t sit there easily and some pre- and/or post-processing is needed on the records.

The current database software doesn’t cope well with the number and size of records we are throwing at it. For instance, the limit on record size is too small and the number of records we have means the database has to be structured in such a way as makes updating slower than we would like. We’d also like a something with faster searching.

We haven’t decided what the replacement software is going to be, though we have been thinking about how a new Copac database might be structued…

De-duplication

Some people think we do too much de-duplication of our records, others think we do too little. So, we are thinking of having two levels of de-duplication, one at the the FRBR work level and another level of de-duplication broadly based on edition and format. The two levels would be linked in a 1 to n relationship. I.e. a FRBR level record would link to several edition level records. An edition level record would link back to one FRBR level record and also other edition level records which link to the same FRBR record. This would result in a three level hierarchy with the individual library records at the bottom. How this would translate in to a user interface is yet to be decided.

Holdings statements

We currently store local holdings information in with the main bibliographic record. Doing otherwise in a non-relational database would have been troublesome. The plan is to keep the holdings out of the bibliographic records and only pull it in when it is needed.

Updating

This should enable us to reduce the burden of the vast number of updates we have to perform. For instance, we sometimes receive updates from our contributing libraries of over 100,00 records and updates of over a quarter million records is not unknown. Our larger contributors send updates of around twenty thousand records on a weekly basis. We now have over 50 contributing libraries and that adds up to a lot of records every week that we need to push through the system.

Unfortunately for us, many of these updated records probably only have changes to local data and no changes to the bibliographic data. However, the current system means we have to delete it from the database and then add it back in. If a record was part of a de-duplicated set then that delete and add results in the de-duplicated record being rebuilt twice for probably no overall change to the bibliographic details.

So, the plan for a new system is that when a library updates a record we will immediately update our copy that stores the local data and mark for update the FRBR level and edition level records it is a part of. The updating of these de-duplicated record sets will be done off-line or during the small hours when the systems are less busy. If we can determine that an updated record had no changes to the bibliographic data then there would be no need to update the de-duplicated sets at all.

What now?

We think we know how we are going to do all the above and our next step is to produce a mock-up we can use to test our ideas…

Written by Ashley

August 18th, 2008 at 12:45 pm