Copac deduplication

Over 60 institutions contribute records to the Copac database. We try to de-duplicate those contributions so that records from multiple contributors for the same item are “consolidated” together into a single Copac record. Our de-duplication efforts have reduced over 75 million records down to 40 million.

Our contributors send us updates on a regular basis which results in a large amount of database “churn.” Approximately one million records a month are altered as part of the updating process.

Updating a consolidated record

Updating a database like Copac is not as immediately intuitive as you may think. A contributor sending us a new record may result in us deleting a Copac record. A contributor who deletes a record may result in a Copac record being created. A diagram may help explain this.

A Copac consolidated record created from 5 contributed records. Lines show how contributed records match with one another.

The above graph represents a single Copac record consolidated from five contributed records: a1, a2, a3, b1 & b2. A line between two records indicates that our record matching algorithm thinks the records are for the same bibliographic item. Hence, record a1,a2 & a3 match with one another; b1 & b2 match with each other and a1 matches with b1.

Should record b1 be deleted from the database, then as b2 does not match with any of a1, a2 or a3 we are left with two clumps of records. Records a1, a2 & a3 would form one consolidated record and b2 would constitute a Copac record in its own right as it matches with no other record. Hence the deletion of a contributed record turns one Copac record into two Copac records.

I hope it is clear that the inverse can happen — that a new contributed record can bring together multiple Copac records into a single Copac record.

The above is what would happen in an ideal world. Unfortunately the current Copac database does not save a log of the record matches it has made and neither does it attempt to re-match the remaining records of a consolidated set when a record is deleted. The result is that when record b1 is deleted, record b2 will stay attached to records a1, a2 & a3. Coupled with the high amount of database churn this can sometimes result in seemingly mis-consolidated records.

Smarter updates

As part of our forthcoming improvements to Copac  we are keeping a log of records that match. This makes it easier for the Copac update procedures to correctly disentangle a consolidated record and should result in less mis-consolidations.

We are also trying to make the update procedures smarter and have them do less. For historical reasons the current Copac database is really two databases: a database of the contributors records and a database of consolidated records. The contributors database is updated first and a set of deletions and additions/updates is passed onto the consolidated database. The consolidated database doesn’t know if an updated record has changed in a trivial way or now represents another item completely. It therefore has no choice but to re-consolidate the record and that means deleting it from the database and then adding it back in (there is no update functionality.) This is highly inefficient.

The new scheme of things tries to be a bit more intelligent. An updated record from a contributor is compared with the old version of itself and categorised as follows:

  • The main bibliographic details are unchanged and only the holdings information is different.
  • The bibliographic record has changed, but not in a way that would affect the way it has matched with other records.
  • The bibliographic record has changed significantly.

Only in the last case does the updated record need to be re-consolidated (and in future that will be done without having to delete the record first!) In the first two cases we would only need to refresh the record that we use to create our displays.

 

An analysis of an update from one of our contributors showed that it contained 3818 updated records; 954 had unchanged bibliographic details and only 155 had changed significantly and needed reconsolidating. The saving there is quite big. In the current Copac database we have to re-consolidate 3818 records. In the new version of Copac we only need to re-consolidate 155. This will reduce database churn significantly, result in updates being applied faster and allow us to have more contributors.

Example Consolidations

Just for interest and because I like the graphs, I’ve included a couple graphs of consolidated records from our test database. The first graph shows a larger set of records. There are two records in this set that when either are deleted would result in the set being broken up into two smaller sets.

The graph below shows a smaller set of records where each record matches with every other record.