To Google or not to Google [with update]

As Ashley has just posted, we’ve just reinstated the links to Google Books that were appearing in the right-hand column of relevant records. Back in March we were pleased to be among the throng of those incorporating the new Google Books API. If Google’s mission is to ‘organize the world’s information and make it universally accessible and useful,’ who are we to argue? What self-respecting library service wouldn’t want to be a part of a project that promotes the Public Good?

Then something unusual happened — we got complaints. Not a great many, but still a vociferous few who questioned why Copac would give Google ‘personal data’ about them as users. Several of us in the team went back and forth over whether this was actually the case. My own opinion was that a) this was not ‘personal’ data, but usage data, and therefore not a threat to any individual’s privacy, and b) even if we were giving Google a little bit of something about our users and how they behaved, what does it matter if the trade-off is an improved system? Nonetheless, we went ahead and added that small script so that Google only spoke to the Copac server. No dice.

I was not all that surprised that our attempt at a workaround wasn’t effective (it would have been nice to have heard something back officially from Google on this front, but we’ll live). I am still wondering if it matters, though. Does it makes sense that we ‘pay’ Google for this API by giving them this information about Copac users — their IP addresses and the ISBNS of books they look at? (Is this, in fact, what we’re doing? Paying them?) Isn’t all this just part of the collective move toward the greater Public Good that the entire Google Books Search project seems to be about?

Ultimately, right now, yes. This is the trade-off we’re willing to make. So we’ve reinstated the links, but also added an option under Preferences for now to allow users to de-googlise their searches. Turning off the feature for good would be reactionary to say the least (and perhaps, more to the point, in the political landscape in which Copac operates, *seen* as reactionary). Right now, if you’re in the ‘Resource Discovery’ business, then a good relationship with the most ubiquitous and powerful search engine in the world is of no small importance.

Indeed, behind the scenes, our colleagues at RLUK have been working with Google on our behalf to sign an agreement which will mean that Google can spider Copac records. The National Archives has recently done this, and from what I hear anecdotally from people there, it’s already having a dramatic impact on their stats — thousands users are discovering TNA records through Google searches, and so discovering a resource they might not have known about before. We are hoping that users will have a similar experience with Copac, especially those looking for unique and rare items held in UK libraries that might not surface through any other means. We are eager to see what sort of impact a Google gateway to Copac will have, and we know it can only enhance the exposure of the collections. We’re also exploring this option for The Archives Hub.

Of course, this also means that Google gets to index more information about Copac web searches. David Smith’s article last week “Google, 10 years on. Big Friendly Giant or Greedy Goliath?” highlights some of the broader concerns about this. To what extent should we be concerned about the fact that a corporation is hoovering up information about our usage behaviour? I am always suspicious of overblown language surrounding technology, and Smith’s article does invoke quite a number of metaphors connoting a dark and grasping Google that we’d better start keeping an eye on, “Google’s tentacles are everywhere.”

But invokations of the ‘Death Star’ notwithstanding (!) I think we’re all learning to be a bit more cautious about our approach to Google. It may not be the Dark Lord, but it’s no ‘Big Friendly Giant’ either. For now, we’re pleased to be able to plug in Google’s free API (thank you, Google) and that Copac will soon be searchable via the engine. But nothing is entirely free, or done for entirely altruistic purposes — this is business after all. We just have to keep that in mind and talk constructively and openly about what we’re willing to pay.

[Updated to add: Likely much too late in the game, but I've just spent an hour or so listening to The Library 2.0 Gang's podcast with Frances Haugen, product manager for the Google Book Search API.  Tim Spalding (LibraryThing) and Oren Beit-Arie (Ex Libris) were among those to pose some of the tougher questions surrounding the API and specifically the fact the it only works client-side and forces the user into the google environment.  According to Frances, future developments will include a server-side API, and that an ultimate goal would be to move to a place where the API can be used to mash up data in new interface contexts.  We'll certainly be watching this space:-)]

Google Book Search

We have re-enabled links to Google Book Search, again. I you haven’t already seen these links, they appear in the sidebar of the Full Record display underneath the menu and cover image. The link text will read either as either “Google Full View”, “Google Preview” or “Google Book Search” depending on the amount and type of information held by Google.

Javascript embedded within the Full Record page connects to Google Book Search to determine if Google hold information on the work. This enables us to show links to Google only when there is something useful to see when you follow the link. The downside to this is that Google will log the IP address of your computer, any Cookies they have previously set on your browser and the ISBN of the work you are viewing; even if you don’t follow the link.

Some of our users expressed concerns about being forced to link to Google and so we changed the way in which the connect to Google was performed. We had a small script on our server act as an intermediary between your computer and Google. That way your computer was only talking to our server and all the connects to Google Book Search originated from our server. This worked okay for a short amount of time until our script was blocked by Google — the message sent back to our script from Google was that “your query looks similar to automated requests from a computer virus or spyware application.” Which I can understand. We did try contacting people at Google to see if there was any way we could keep using our script. All we’ve had from Google is a deathly silence.

So we’ve re-instated the links to Google Book Search and we now have a Preferences page which enables you to turn the links off if you don’t like Google being able to track what you do on Copac. You will need cookies enabled on your browser for the Preference settings to work. The link to the Preferences page appears in the sidebar menu on the search forms and Full and Brief record displays.

York Minster Library Loaded

Image of York MinsterYork Minster Library is the largest cathedral library in the UK. It holds over 130,000 books with collections covering topics such as Theology, Art History, Stained Glass, History, Literature and Religious texts.

At the Minster, this collection is housed on the ground floor of the 13th century Archbishops’ Chapel and is openly available to the public and can be checked out by registered users.

The catalogue has been added as part of the Copac Challenge Fund.

City of London Libraries, Guildhall Library Loaded.

We have loaded the catalogue of Guildhall Library, a collection which will be of great benefit to the research community. Guildhall Library, is one of the 5 public libraries operated by the City of London Corporation and is one of the oldest public libraries in the country. The Library has one of the world’s most comprehensive collections of printed works on London history and also outstanding resources on diverse subjects such as food and wine, gardening, law reports, English parliamentary papers, local history, marine history, clock and watch-making and archery.

The catalogue has been added as part of the Copac Challenge Fund.

Re-structuring the database

We are thinking of changing the database we use to run Copac. The current database software we use is very good at what it does, which is free text searching, but it is proving problematical in other areas. For instance, it doesn’t know about Unicode or XML, which was okay some years ago when 7-bit ASCII was the norm, record displays were simple and there was less interest in inter-operating with other people and services. We have managed to shoehorn Unicode and XML into the database, though it doesn’t sit there easily and some pre- and/or post-processing is needed on the records.

The current database software doesn’t cope well with the number and size of records we are throwing at it. For instance, the limit on record size is too small and the number of records we have means the database has to be structured in such a way as makes updating slower than we would like. We’d also like a something with faster searching.

We haven’t decided what the replacement software is going to be, though we have been thinking about how a new Copac database might be structued…

De-duplication

Some people think we do too much de-duplication of our records, others think we do too little. So, we are thinking of having two levels of de-duplication, one at the the FRBR work level and another level of de-duplication broadly based on edition and format. The two levels would be linked in a 1 to n relationship. I.e. a FRBR level record would link to several edition level records. An edition level record would link back to one FRBR level record and also other edition level records which link to the same FRBR record. This would result in a three level hierarchy with the individual library records at the bottom. How this would translate in to a user interface is yet to be decided.

Holdings statements

We currently store local holdings information in with the main bibliographic record. Doing otherwise in a non-relational database would have been troublesome. The plan is to keep the holdings out of the bibliographic records and only pull it in when it is needed.

Updating

This should enable us to reduce the burden of the vast number of updates we have to perform. For instance, we sometimes receive updates from our contributing libraries of over 100,00 records and updates of over a quarter million records is not unknown. Our larger contributors send updates of around twenty thousand records on a weekly basis. We now have over 50 contributing libraries and that adds up to a lot of records every week that we need to push through the system.

Unfortunately for us, many of these updated records probably only have changes to local data and no changes to the bibliographic data. However, the current system means we have to delete it from the database and then add it back in. If a record was part of a de-duplicated set then that delete and add results in the de-duplicated record being rebuilt twice for probably no overall change to the bibliographic details.

So, the plan for a new system is that when a library updates a record we will immediately update our copy that stores the local data and mark for update the FRBR level and edition level records it is a part of. The updating of these de-duplicated record sets will be done off-line or during the small hours when the systems are less busy. If we can determine that an updated record had no changes to the bibliographic data then there would be no need to update the de-duplicated sets at all.

What now?

We think we know how we are going to do all the above and our next step is to produce a mock-up we can use to test our ideas…

Lifting the Copac Curtain

Since joining the Copac team at Mimas earlier this year, it has struck me that there’s a certain facelessness to Copac. Copac just ‘is’ and talented people working industriously behind it are largely invisible — which has made sense, of course. We have a hunch that Copac users don’t really care too much about how we’re tackling new API developments or if we’re Shibbolized or not.

But there’s another community out there, the one comprising of professionals, librarians, academics, techies and geeks, the one that *does* get all a twitter about bibliometrics, FRBR, Open Source, data-sharing, and Library 2.0. The one comprising of People Like Us. And so we decided it was time to get out from behind the curtain and join in the conversation. We not only want to share what we’re up to behind the scenes, but also be part of the larger conversation — share our knowledge, get feedback, and learn so we can move Copac forward.

My colleague Ashley has just christened this here blog with a post on our new OAI-PMH interface (so please, harvest away and let us know how you get on) and we hope this will be the first of many posts about our developments — not just information, but voicing up about the challenges we’re tackling around development, including Web 2.0 and the way in which Copac might support customisation or adaptive personalisation, or the potential contexts for reuse and sharing of Copac data. In addition, we’re working with EDINA on the JISC funded D2D project, and so over the next year we’ll be working with our colleagues over in Edinburgh to join up Copac with Zetoc and SUNCAT, and also to think about new ways in which this collective data along with new tools can better support the workflows of Scholarly Communication.

There’s going to be a lot to write about, and we know one of our main challenges will be keeping this blog up-to-date. Watch this space, and hopefully something new will appear on it every so often:-). Meanwhile, get to know your hosts — Ashley and Joy.

(p.s. we’ll be building up our blogroll over the next few weeks, but if you find something you think we’d find interesting in your blog-reader do leave us a comment and point us to it)

Harvesting records from Copac with OAI-PMH

We have recently implemented an OAI-PMH interface to Copac (access details can be found on our Copac Interfaces web page.) It has passed all the tests of the OAI-PMH protocol validator hosted by the Open Archives Initiative, so I’m reasonably confident we have a useable service up and running. The niggling doubt I have is that the OAI Repository Explorer seems to have a problem parsing the MODS records we deliver — it has no problems with our DC records, so I’m hoping it is a problem at their end, not ours. If you know different then please let us know!

Several Sets are available through which you can harvest sub-sets of the Copac database. We currently have only four Sets: Music, Sounds, Images and Maps. If there are sub-sets of the Copac database you think it would be useful to harvest and are not in the list, then let us know and we’ll see what we can do. (We are limited in what Sets we offer by what is easily and efficiently searched for in the database.)

The Copac database contains almost 34 million records. We are a little worried about the performance hit our servers (and hence our users) might suffer should someone decide to try and harvest the whole of the database. Therefore, we are currently insisting that any ListRecords or ListIdentifiers requests must specify a Set. If no Set is specified then a BadArgument error is returned.

Unfortunately we do not maintain information about deleted records, so it will be necessary to periodically re-harvest records from us. I wish we could offer deletion information, however the way we receive records from our contributing Institutions and the de-duplication process makes it very difficult to track deletions and keep a stable record number. But such difficulties are probably a blog post for the future.