Copac Beta Interface
Written by Ashley : without comments
We’ve just released the beta test version of a new Copac interface and I thought I’d write a few notes about it and how we’ve created it.
Some of the more significant changes to the search result page (or “brief display” as we call it) are:
- There are now links to the library holdings information pages directly from the brief display. You no longer have to go via the “full record” page to get to the holdings information.
- You can see a more complete view of a record by clicking on the magnifying glass icon at the end of the title. This enables you to quickly view a more detailed record without having to leave the brief display.

- You can quickly edit your query terms using the search forms at the top of the page.
- To further refine your search you can add keywords to the query by typing them into the “Search within results” box.
- You can change the number of records displayed in the result page.
The pages have been designed using Responsive Web Design techniques — which is jargon that means that the HTML5 and CSS have been designed in such a way that the web page rearranges itself depending on the size of your screen. The new interface should work whether you are using a desktop with a cinema display, a tablet computer or a mobile phone. Users of those three display types will see a different arrangement of screen elements and some may be missing altogether on the smaller displays. If you use a tablet computer or smartphone, then please give beta a try on them and let us know what you think.
The CGI script that creates the web pages is a C++ application which outputs some fairly simple, custom, XML. The XML is fed through an XSLT stylesheet to produce the HTML (and also the various record export formats.) Opinion on the web seems divided on whether or not this is a good idea; the most valid complaints seem to be that it is slow. It seems fast enough to us and the beta way of doing things is actually an improvement as there is now just one XSLT used in creating the display, whereas our old way of doing things used multiple XSLT stylesheets run multiple times for each web page. Which probably just goes to show that the most significant eater of time is the searching of the database rather than the creation of the HTML.
Copac deduplication
Written by Ashley : without comments
Over 60 institutions contribute records to the Copac database. We try to de-duplicate those contributions so that records from multiple contributors for the same item are “consolidated” together into a single Copac record. Our de-duplication efforts have reduced over 75 million records down to 40 million.
Our contributors send us updates on a regular basis which results in a large amount of database “churn.” Approximately one million records a month are altered as part of the updating process.
Updating a consolidated record
Updating a database like Copac is not as immediately intuitive as you may think. A contributor sending us a new record may result in us deleting a Copac record. A contributor who deletes a record may result in a Copac record being created. A diagram may help explain this.

A Copac consolidated record created from 5 contributed records. Lines show how contributed records match with one another.
The above graph represents a single Copac record consolidated from five contributed records: a1, a2, a3, b1 & b2. A line between two records indicates that our record matching algorithm thinks the records are for the same bibliographic item. Hence, record a1,a2 & a3 match with one another; b1 & b2 match with each other and a1 matches with b1.
Should record b1 be deleted from the database, then as b2 does not match with any of a1, a2 or a3 we are left with two clumps of records. Records a1, a2 & a3 would form one consolidated record and b2 would constitute a Copac record in its own right as it matches with no other record. Hence the deletion of a contributed record turns one Copac record into two Copac records.
I hope it is clear that the inverse can happen — that a new contributed record can bring together multiple Copac records into a single Copac record.
The above is what would happen in an ideal world. Unfortunately the current Copac database does not save a log of the record matches it has made and neither does it attempt to re-match the remaining records of a consolidated set when a record is deleted. The result is that when record b1 is deleted, record b2 will stay attached to records a1, a2 & a3. Coupled with the high amount of database churn this can sometimes result in seemingly mis-consolidated records.
Smarter updates
As part of our forthcoming improvements to Copac we are keeping a log of records that match. This makes it easier for the Copac update procedures to correctly disentangle a consolidated record and should result in less mis-consolidations.
We are also trying to make the update procedures smarter and have them do less. For historical reasons the current Copac database is really two databases: a database of the contributors records and a database of consolidated records. The contributors database is updated first and a set of deletions and additions/updates is passed onto the consolidated database. The consolidated database doesn’t know if an updated record has changed in a trivial way or now represents another item completely. It therefore has no choice but to re-consolidate the record and that means deleting it from the database and then adding it back in (there is no update functionality.) This is highly inefficient.
The new scheme of things tries to be a bit more intelligent. An updated record from a contributor is compared with the old version of itself and categorised as follows:
- The main bibliographic details are unchanged and only the holdings information is different.
- The bibliographic record has changed, but not in a way that would affect the way it has matched with other records.
- The bibliographic record has changed significantly.
Only in the last case does the updated record need to be re-consolidated (and in future that will be done without having to delete the record first!) In the first two cases we would only need to refresh the record that we use to create our displays.
An analysis of an update from one of our contributors showed that it contained 3818 updated records; 954 had unchanged bibliographic details and only 155 had changed significantly and needed reconsolidating. The saving there is quite big. In the current Copac database we have to re-consolidate 3818 records. In the new version of Copac we only need to re-consolidate 155. This will reduce database churn significantly, result in updates being applied faster and allow us to have more contributors.
Example Consolidations
Just for interest and because I like the graphs, I’ve included a couple graphs of consolidated records from our test database. The first graph shows a larger set of records. There are two records in this set that when either are deleted would result in the set being broken up into two smaller sets.
The graph below shows a smaller set of records where each record matches with every other record.
Performance improvements
Written by Ashley : without comments
The run up to Christmas (or Autumn term if you prefer) is always our busiest time of year as measured by the number of searches performed by our users. Last year the search response times were not what we would have liked and we have been investigating the causes of the poor performance and ways of improving it. Our IT people determined that at our busiest times the disk drives in our SAN were being pushed to their maximum performance and just couldn’t deliver data any faster. So, over the summer we have installed an array of Solid State Disks to act as a fast cache for our file-systems (for the more technical I believe it is actually configured as a ZFS Level 2 Cache.)
The SSD cache was turned on during our brief downtime on Thursday morning and so far the results look promising. I’m told the cache is still “warming up” and that performance may improve still further. The best performance indicator I can provide is the graph below. We run a “standard” query against the database every 30 minutes and record the time taken to run the query. The graph below plots the time (in seconds) to run the query since midnight on the 23rd August 2011. I think it is pretty obvious from looking at the graph exactly when the SSD cache was configured in.
It all looks very promising so far and I think we can look forward to the Autumn with less trepidation and hopefully some happier users.
Getting Excited about Collection Management
Written by joy : without comments
The Copac Collections Management Tools Project is a collaboration between Mimas, RLUK, and the White Rose Consortium.
A number of partners have been working through and with us here at Mimas on a JISC funded Collection Management project, which is part of the broader Resource Discovery Taskforce activity
Since we have all been working on this slightly under the radar, and recognising the need to share more about this project and what’s going on, we’re planning series of blog posts to update the community on the progress and lessons learned through the partnetship. The following update is from Julia Chruszcz, who is project managing this piece of work:
Just two months into the JISC funded Copac Collection Management Project the progress has been significant. At a meeting of the project partners on the 6th May each of the representatives from the White Rose Consortium (WRC) universities (Leeds, York and Sheffield) articulated the potential significance of this tool on their decision making processes around monograph retention and disposal and collection development. This included notions of collaborative collection development and how such a Collection Management Tool could facilitate regional and national approaches, each influencing local decisions for libraries.
The WRC has undertaken the early testing of the web-based tool in an approach that the project has adopted to inform development and iteratively assess the tool. The idea is to build up a full specification over the life of the project of what will be required to take such a tool forward to introduce into library workflows. The next stage, between now and the beginning of July will be to further develop the batch and web technical interfaces based upon the WRC feedback and for this development to undergo further critical testing. The project is due to provide an interim report at the end of June with full report to the JISC at the end July.
The enthusiasm from all the project partners, JISC, Mimas, RLUK and WRC, stems from the realisation that we have the potential to produce a tool that will make a real difference to helping libraries make informed decisions particularly at a time of financial constraint, and assist in furthering the possibility of a national monographs collection, protecting access for researchers at the same time as facilitating local decisions that will save money and resource longer term. And all this by intelligent re-use and application of an existing extensive database, a resource invested in by RLUK and the JISC over many years, the Copac database.
If this is something you are interested in we’d really like to hear your view point and perspective.
Surfacing the Academic Long Tail — Announcing new work with activity data
Written by joy : without comments
We’re pleased to announce that JISC has funded us to work on the SALT (Surfacing the Academic Long Tail) Project, which we’re undertaking with the University of Manchester, John Rylands University Library.
Over the next six months the SALT project will building a recommender prototype for Copac and the JRUL OPAC interface, which will be tested by the communities of users of those services. Following on from the invaluable work undertaken at the University of Huddersfield, we’ll be working with ten years+ of aggregated and anonymised circulation data amassed by JRUL. Our approach will be to develop an API onto that data, which in turn we’ll use to develop the recommender functionality in both services. Obviously, we’re indebted to the previous knowledge acquired by a similar project at the University of Huddersfield and the SALT project will work closely with colleagues at Huddersfield (Dave Pattern and Graham Stone) to see what happens when we apply this concept in the research library and national library service contexts.
Our overall aim is that by working collaboratively with other institutions and Research Libraries UK, the SALT project will advance our knowledge and understanding of how best to support research in the 21st century. Libraries are a rich source of valuable information, but sometimes the sheer volume of materials they hold can be overwhelming even to the most experienced researcher — and we know that researchers’ expectation on how to discover content is shifting in an increasingly personalised digital world. We know that library users — particularly those researching niche or specialist subjects — are often seeking content based on a recommendation from a contemporary, a peer, colleagues or academic tutors. The SALT Project aims to provide libraries with the ability to provide users with that information. Similar to Amazons, ‘customers who bought this item also bought….’ the recommenders on this system will appear on a local library catalogue and on Copac and will be based on circulation data which has been gathered over the past 10 years at The University of Manchester’s internationally renowned research library.
How effective will this model prove to be for users — particularly humanities researchers users?
Here’s what we want to find out:
- Will researchers in the field of humanities benefit from receiving book recommendations, and if so, in what ways?
- Will the users go beyond the reading list and be exposed to rare and niche collections — will new paths of discovery be opened up?
- Will collections in the library, previously undervalued and underused find a new appreciative audience — will the Long Tail be exposed and exploited for research?
- Will researchers see new links in their studies, possibly in other disciplines?
We also want to consider if there are other potential beneficiaries. By highlighting rarer collections, valuing niche items and bringing to the surface less popular but nevertheless worthy materials, libraries will have the leverage they need to ensure the preservation of these rich materials. Can such data or services assist in decision-making around collections management? We will be consulting with Leeds University Library and the White Rose Consortium, as well as UKRR in this area.
(And finally, as part of our sustainability planning, we want to look at how scalable this approach might be for developing a shared aggregation service of circulation data for UK University Libraries. We’re working with potential data contributors such as Cambridge University Library, University of Sussex Library, and the M25 consortium as well as RLUK to trial and provide feedback on the project outputs, with specific attention to the sustainability of an API service as a national shared service for HE/FE that supports academic excellence and drives institutional efficiencies.
Auto-complete considered harmful?
Written by admin : without comments
Behind the scenes we’ve been creating new versions of Copac that use relational database technology (the current version of Copac doesn’t use a relational database.) It’s a big change which has kept me busy for a long time now. One of things we thought it would be nice to do with all this structured data is to have fields on our web search forms offer suggestions (or auto-complete) as the user types.
It turned out that implementing auto-complete was very easy thanks to JQuery UI. Below is a screen shot (from my test interface) showing the suggestions that auto-complete offers after typing “sha” in the author field.
The suggestions are ordered by how frequently the name appears in the database. So in the screen shot above, “Shakespeare, Willian, 1564-1616″ is the most frequently occurring name starting with the letters “sha” in my test database.
(By the way, these example screen shots are from a test database of about 5 million records selected in a very un-random way from from seven of our contributing libraries.)
Having done the Author auto-complete I started thinking about how we would present suggestions for a Title auto-complete popup. It didn’t seem useful to present the user with an alphabetical list of titles, neither did it seem much more useful to present the most commonly occurring titles. I thought we could relatively easily log which records users view and then present the suggestions ranked according to how often a title has been viewed.
Then I thought that if a user has already selected an author from the Author auto-complete suggestions, it only makes sense to suggest titles that are by the selected author. For example, a user has selected Shakespeare from the author auto-complete suggestions. They then type “lo” in the title field. It would be pointless and counter-intuitive to list “Lord of the Rings” in the title suggestions; what we should show is “Love’s Labour’s Lost”. But then, by the time you’ve created that list of suggestions for the user you’ve pretty much done their search for them already. So why not just show them the search results straight away? Google are doing this now with their Instant search results. Well as hip and sexy as that sounds I don’t think we can go there. For a start I don’t think we have the compute horsepower to make it as instant as Google do and there are fundamental data problems which make it very hard for us to do well.
So, going back to the Author auto-suggestions, lets look what happens when I type “tol” in the author field:
Again, the author suggestion look very nice, but unfortunately the list contains Leo Tolstoy twice: at the top of the list as “Tolstoy, Leo, graf, 1828-1910″ and at the bottom of the list as “Tolstoy, Leo”. That’s because there’s no consistent Authority Control across our ~60 contributing libraries (and then there’s all the typos to consider.).
There’s two ways we can turn a user selection from an auto-complete list into a search.
- We can turn the author name into a keyword search.
- Each of those names in the list has a unique database ID and we can search for records that have that author-ID.
If we do 2.) then selecting one form of the name Leo Tolstoy will only find records with that exact form and wont find records that have the second (or third or fourth) form of the name. This will give the search a lot of precision but the recall is likely to be terrible.
If we do 1.) then the top ranking “Tolstoy, Leo, graf, 1828-1920″ will only find a subset of our Tolstoy records. As there are a substantial set of records that don’t include “graf, 1828-1910″ a keyword search including those terms will miss those records entirely. If the user selected “Tolstoy, Leo” from the list they will likely find all the Leo Tolstoy records in the database (except those catalogued as “Tolstoy, L.” and those records with typos.) The user may wonder why the name variant that finds most records is listed 10th, while the name listed first finds only a subset.
Maybe we could get around these problems by only using the MARC $a subfield from the 100 and 700 tags. (The examples above are using 100 $a$b$c$d.) Doing that would remove all the additions to names such as “Sir” and the dates. That would probably be okay for authors with distinctive names, but could merge lots of authors with common names. It would reduce search precision and increase recall.
So far I’ve only considered auto-complete on author and title fields. The Copac search forms have many fields and I’m not sure we have the facilities or compute power to inter-relate all the auto-complete suggestions so that the user only sees suggestions that make sense according to the fields the user has already filled in.
If we could inter-relate all the fields on our search forms we would probably know the search result before the user hit the search button. So what would be the point of having a search button anyway? That brings us back to the Google Instant search type of interface.
What should we do?
- We could just not bother trying to inter-realte the auto-complete suggestions and let users select mutually incompatible suggestions. (Which seems rather unhelpful.)
- We could not do auto-complete at all. (Again, this seems un-helpful at first sight, but may be better as the auto-complete seems to effect an increase in search precision which may not be useful against a database containing very variable quality data.)
- We could have just a single field on our search form. (Much easier to program, but not what our users tell us they want.)
- Just offer auto-complete on a two or three fields and inter-relate them. (To make this work I think we’d have to make the suggestions as imprecise as we can without them being a waste of space.)
I hope the above ramblings make some sense. If anyone has thoughts on this issue we’d like to hear your views.
Hardware move
Written by Ashley : without comments
The hardware move has gone relatively smoothly today. We’ve had some configuration issues that prevented some Z39.50 users from pulling back records and another configuration problem that meant a small percentage of the records weren’t visible. That should all be fixed now, but if you see something else that looks like a problem, then please let us know.
The DNS entry for copac.ac.uk was changed at about 10am this morning. At 4pm we’re still seeing some usage on the old hardware. However, most usage started coming through to the new machine very soon after the DNS change.
The change over to the new hardware has involved a lot of preparation over many weeks. Now it’s done we can now get back to re-engineering Copac… a new database backend and new search facilities for the users.
Copac at Interlend 2010.
Written by Lisa : without comments
Interlend is the Annual Conference for the CILIP’s Forum for Interlending and Information Delivery. This year’s theme was ‘Meeting the Challenge: Co-operation & Collaboration’ and was held at the Nottingham Belfry from Monday 28th-30th June.
Copac coordinator Shirley Cousins and me (Lisa Jeskins) were asked to present one of the parallel sessions, ‘Copac: your union catalogue today and tomorrow’. We wanted to demonstrate some of the forthcoming Copac developments and get Inter-library loans (ILL) librarians to share their thoughts with us. We wanted to know how they felt about Copac and how we could help them to do their job.
We split the session into two: first, Shirley talked about some of the things we have been working on to improve Copac; and then I was got the delegates to do some work!
Shirley gave the Interlend delegates an overview of our login feature that provides users with extra functionality (See post on: Copac’s new interface) and talked about Copac’s re-engineering project. (See post on: It’s official – Copac’s re-engineering) She even gave them a sneak preview of what a faceted Copac might look like. (You can see Shirley’s ppt here: http://www.slideshare.net/LisaJeskins/copac-your-union-catalogue-today-and-tomorrow)
I facilitated for the discussion part of the session, and split the delegates up into four groups. I asked the groups to introduce themselves and explain what their role in interlending was. I then asked them to think about the following questions:
- How can we make your ILL work processes more efficient?
- e.g. extra ILL information on the holdings page for each library. If yes, what type of information?
- If we were to have a Librarian’s interface what should it include?
- e.g. option to search only those libraries that do document supply.
- In an ideal world, what do you wish Copac could do for you as an ILL librarian?
- e.g. link to your institution ILL page?
- You can think out of the box on this too, and we can always go away and discuss what is technically possible.
We wanted delegates to record their thoughts on flipchart paper and then feedback the main points of their group discussion to the room.
The parallel session was scheduled to run twice, and it was obvious right from the start that common themes were emerging. The 5 top issues were:
- ILL librarians want to easily see which libraries take part in document supply – who lends and who doesn’t. They would also appreciate it if it was easier for users to see which libraries lend their materials and which don’t. This would enable them to better manage their users’ expectations.
- ILL librarians want to see the British Library’s codes on Copac. These tell ILL librarians whether a library does document supply.
- ILL librarians do think that a link to their institutional ILL Page would be useful.
- ILL librarians would like to see more deduplication, but interestingly don’t necessarily want electronic and print items merged as this can cause problems if the e-version isn’t licensed for document supply.
- ILL librarians would like to see links to libraries document supply polices and prices should they differ from standard IDS charges.
Some interesting and original suggestions included providing a recommender function (something which we are currently looking into). We hadn’t realised that this could be useful for a stumped ILL librarian. One group added that Copac doesn’t currently recognise dashes in ISBNs that students have copied and pasted into the search box. Several groups also commented that they would like to see more libraries on Copac. We are going to investigate ways of taking these and other suggestions forward.
The day was really useful for both of us. We came away with a better understanding of how we could improve Copac to help ILL librarians and we are going to explore these possibilities further. We also made some very useful contacts, who’d like to participate in Copac’s future development. If you would like to get involved or share with us your thoughts on how we can help you as ILL librarians, please contact us at copac@mimas.ac.uk.
Expanding Copac
Written by bethan : without comments
It’s exciting times for Copac – we’re working on improving the Copac user experience and we’ve been looking at other aspects of the development of Copac. A vital aspect of this development is expanding and enhancing Copac’s coverage, through the addition of new libraries and collections.
Copac’s original remit was to be the merged union catalogue of the holdings of the RLUK (then CURL) libraries. This was expanded in 2006 to include libraries added through the Copac Challenge Fund (http://www.rluk.ac.uk/node/59), which aimed to ‘facilitat[e] the discovery of the widest possible range of research materials’. The specific Challenge Fund activity has ended, but we are still committed to helping to expose rare and under-used materials, and are accepting informal applications from libraries wishing to be included in Copac. We ask applicants for some information about their collections, and ensure that they meet some basic technical criteria.
Our main focus is on supporting UK education and research, and we prioritise collections with large amounts of rare, scarce, and under-exposed material. We accept applications from all types of library, not just academic and research, and we will take specific collections from a library – eg while a public library’s lending collections may not fit in with Copac’s remit, they may have special collections that do.
Our Steering Committee is meeting in September, and they’ll be discussing strategies and priorities to ensure that Copac’s growth remains mission-focussed and sustainable. Much as we as a team would be delighted to add every library, that simply isn’t feasible in the short-term for a variety of reasons. Our steering committee will help us to prioritise our inclusions over the next two years. Longer-term, we’re going to develop a new strategy for Copac, and our future approach to content development will be high on the agenda.
Until then, we’re looking at ‘quick wins’ for helping users access more content held across the UK. For instance, the ‘your local library’ tool. We’ve been working with academic libraries whose collections are not on Copac to cross-search their collections through z39.50. When a user of the library signs in to Copac, they get the opportunity to search their institution’s records alongside Copac.
If you’re interested in learning more, please email copac@manchester.ac.uk. We’d be pleased to hear from you.
Behind the Copac record 2: MODS and de-duplication
Written by bethan : without comments
We left the records having been rigorously checked for MARC consistency, and uploaded to the MARC21 database used for the RLUK cataloguing service. Next they are processed again, to be added to Copac.
One of the major differences between Copac and the MARC21 database is that the Copac records are not in MARC21. They’re in MODS XML, which is
an XML schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. It is a derivative of the MARC 21 bibliographic format (MAchine-Readable Cataloging) and as such includes a subset of MARC fields, using language-based tags rather than numeric ones.
Copac records are in MODS rather than MARC because Copac records are freely available for anyone to download, and use as they wish. The records in the MARC21 database are not – they remain the property of the creating library or data provider. We couldn’t offer MARC records on Copac without getting into all sorts of copyright issues. Using MODS also means we have all the interoperability benefits of using an XML format.
Before we add the records to Copac we check local data to ensure we’re making best use of available local holdings details, and converting local location codes correctly. Locations in MARC records will often be in a truncated or coded form, eg ‘MLIB’ for ‘Main Library’. We make sure that these will display in a format that will be meaningful to our users.

It is also at this point that we do the de-duplication of records for Copac. Now, Copac de-duplication garners very mixed reactions: some users think we aren’t doing enough de-duplication; and occasionally we get told that we’re doing too much! We can’t ever hope to please everyone, but we’re aware that the process isn’t perfect, and we’ll be reviewing and updating deduplication during the reengineering. We will also be exploring FRBR work level deduplication.
As I’ve mentioned in an earlier blog post , we don’t de-duplicate anything published pre-1801. So what do we do for the post-1801 records?
As new records comes in we do a quick and dirty match against the existing records using one or more of ISBN, ISSN, title key and date. This identifies potential matches which go through a range of other exact and partial field matches. The exact procedure will vary depending on the type of material, so journals (for instance) will go through a slightly different process than monographs.
Records that are deemed to be the same are merged and for many fields unique data from each record is indexed. This provides for enhanced access to materials eg. a wider range of subject headings than would be present in any of the original records. The deduplication process can thus result in the creation of a single enhanced record containing holdings details for a range of contributing libraries.
As we create the Copac records we also check for the availability of supplementary content information for each document, derived from BookData. We incorporate this into the Copac record further enhancing record content for both search and display, eg. a table of contents, abstract, reviews.
Because the deduplication process is fully automated it needs to err on the side of caution, otherwise some materials might disappear from view, subsumed into similar but unrelated works. This can mean records that appear to be self-evident duplicates to a searcher may be separated on Copac because of minor differences in the records. Changes made to solve one problem example could result in many other records being mis-consolidated. It’s a tricky balance.
However, there is another issue: the current load and deduplication is a relatively slow process. We have large amounts of data flowing onto the database everyday and restricted time for dealing with updates. Consequently, where a library has being making significant local changes to their data, and we get a very large update (say 50,000 records), then this will be loaded straight onto Copac without going through the deduplication process.
This means that the load will, almost certainly, result in duplicate records. These will disappear gradually as they are pulled together by subsequent data loads, but it is this bypassing of the deduplication procedure in favour of timeliness, that results in many of the duplicate records visible on Copac. One of the aims of the reengineering is to streamline the dataload process, to avoid this update bottleneck, and improve overall duplicate consolidation levels.
So, that’s the Copac record, from receipt to display. We hope you’ve enjoyed this look behind the Copac records. Anything else you’d like to know about? Tell us in the comments!
Thanks to Shirley Cousins for the explanation of the de-duplication procedures




