Atom and Shibboleth

The Search History and My References feeatures of the Copac Beta Test Interface are stored in a database with an Atom Publishing Protocol (APP) Interface. The idea is to make the database open to use by other people and services and so enable re-purposing of the data.

Authentication poses a problem. We need to authenticate so that we can identify the user and show them their records and not someone elses. We didn’t want people to have to register to use Copac and neither did we want to get into developing a mechanism to handle user registration, etc. So, we have used the JISC supported UK Federation (aka Shibboleth) Access Management system. This allows users to login to Copac using their own instiutional username. Registering separately with Copac is not needed to gain access.

The downside is that Shibboleth is designed to work with web browsers. I don’t know the technacalities of it all, but a login with Shibboleth seems to involve multiple browser redirects, possibly a WAYF asking “Where are you From?” and a web page with a bunch of Javascript that the browser has to interpret that redirects the browser yet again. I’ve tried accessing the Shibboleth protected version of our APP Interface with some APP client software and none of it could get past the authentication — however, it is very hard to diagnose where the problems are.

I also tried the command line program “curl” to access the APP Interface and while it can handle the redirects and the username and password I think it fails when it gets to the page with the Javascript. Which is fair enough, “curl” isn’t a web browser, it is just a program that retrieves urls.

So, can we make do without Shibboleth? Well we can, but the options are either not terribly insecure or not practical. The options I can think of are:

  1. We put a token (eg a unique id) in the url. This effectively makes the users collection of records and search history public if the url is published.
  2. We put the token in a cookie. This is still insecure and subject to cookie highjacking, but is more private as the token isn’t in the url. Many high profile web sites seem to use such an cookie for authentication, and if they do, then I don’t see why we shouldn’t? However, I’m not sure how practical it is to get third party APP clinet software to send the cookie — unless the APP client was written as part of a web browser that already has the cookie.

You can try accessing the Shbboleth protected APP server for yourself at the following url:

  • https://copac.ac.uk/atom/

If you’ve already used the Copac Beta then your Search History and My References collections can be found at the following urls in the form of Atom feeds:

  • https://copac.ac.uk/atom/saved-searches/
  • https://copac.ac.uk/atom/my-references/

Please let us know how you get on! I’ve tried the above urls with Firefox and Safari. Firefox gets through the authentication and displays the Atom feeds and Service Documents. Safari seems to put itself into an infinite loop whilst trying to display the feed (maybe this is something to do with the XML in our Atom feed?)

We’d be very interested to hear your thoughts on the above.

Copac Beta : new search urls

As the new Copac beta test interface is now storing users’ search history in a database we needed Copac search urls to be stateless (or RESTful.) If you look at the current Copac urls, you will notice as you navigate through a result set, just how much saved state is encoded in the url. There are references to the session ID and the number of your query within your session.

In the new scheme of things, that is all gone and I believe our search urls are now stateless — that is, all the information needed to display a search result is now encoded in the url. The CGI script serving the url does not have to go delving into a database to work out what to do.

I’ll attempt here to explain the new url scheme and hopefully you will see how it can be used as a machine to machine interface to Copac. I should point out though, that this is describing the beta version and things may change in the future.

So, to perform an author query against the Copac database, all you need is a url like this:

http://beta.copac.ac.uk/search?au=sutter

The above url will perform an author search for “sutter” and will display an HTML rendered page showing the first page of brief records. If you would like the results sorted, then you can add a “sort-order” element to the url as follows:

http://beta.copac.ac.uk/search?au=sutter&sort-order=ti

The above url will sort the query by the record title field. If the result set is too large to sort, then you will be redirected back to the same query without the sort-order.

If you want to view the first full record in a result set, then add an “rn” element to the url:

http://beta.copac.ac.uk/search?au=sutter&rn=1

Similarly, to view the second page of brief records:

http://beta.copac.ac.uk/search?au=sutter&page=2

All the above urls return an HTML display — not what you want for machine to machine communication. So, to get some programmer friendly XML you can add the “format” element to the url:

http://beta.copac.ac.uk/search?au=sutter&page=1&format=XML+-+MODS

The above url returns a page of MODS XML records. A page, by default, is 25 records. If you’d prefer more or less records in a page, then you can set the page size by sending a “Page-size” header with the HTTP request. And, so that you know how large the result is, a “Result-set-size” header is returned with the HTTP response when a “format” is specified in the url.

You can, of course, specify a “sort-order” along with a “format”. You’ll be able to discover the various query fields, sort and format options by delving around the user interface and performing a few queries. I’m not going to document them here and now as it is all still beta and they may change before we go live.

Search results as an Atom feed?

Here’s a few questions for you. Would it be useful to be able to get your Copac search results as an Atom feed? If so, would it help in aggegrating Copac searches with results from other services? Would it make writing widgets for, say, iGoogle or Netvibes, easier? Would you like Copac urls to be RESTful (I hope so, as they will be before long.)

Yesterday I was thinking about the different search result formats we provide and I was wondering if Atom might be useful. Then a conversation I’ve had this morning with some colleagues have made me think an Atom format could be very useful in the areas outlined above. However, I don’t have experience of implementing widgets or working with Feeds, so I thought I’d ask here. Any thoughts, anyone?

Search history & a stateless interface

One of the things I’d like to do for Copac is to re-write the code behind the web based user interface. The current architecture was designed to work with a Z39.50 server and I now consider it to be too complex. This makes it hard to debug when things go wrong and the complexity of it means that things do go wrong.

So, I’d like to move the interface over to a REST based stateless interface that talks dircectly to the database without going through our Z39.50 interface. This should decrease the time to produce a response after a user hits the search button and should be more reliable.

What I wasn’t too sure about, until now, was how we would incorporate Copac’s Search History feature into a stateless, REST based, interface. The answer came to me during the small hours this morning. We can put the searches into the same Atom Publishing Protocol (APP) repository that we plan to use for the Marked List. (The Search History and Marked List would be separate collections within the repository and so wouldn’t be mixed up together.)

The advantages of this are: the user can have an Atom feed of their searches, they can tag and annotate their searches and generally manipulate their search history by deleting and editing entries through APP client software. We might also be able to include searches from other services. I think such a search history would work for any REST based service. So if we can move other Mimas services, such as Zetoc and the Archives Hub over to a REST based interface, then a user could potentially have, in one place, an archive of all the searches they have performed over a number of different services.

An enhanced Marked List

As part of the D2D work we are enhancing the functionality of the “Marked List” feature in Copac. The Marked List allows you to save records from your search session, for downloading or emailing to yourself in a variety of formats. One of the drawbacks to the Marked List is that it is linked to your search session. That means that when you come back to Copac tomorrow, the records you saved today will have gone.

So, one enhancement is to make your List of saved records permanent, so when you come back next week, everything you saved last week is still there. The downside to this is that you will need to login so that we know who you are and which are your records. If you don’t want to login to use Copac, then you will still be able to, you just wont get the facility of a permanent Marked List.

The current plan is to provide an API to the Marked List and it seems most sensible to use the Atom Publishing Protocol (APP). One of the nice side effects of using APP is that you’ll get an Atom feed of the records you’ve saved, plus you’ll be able to manage your collection of records with a suitable APP client outside of the Copac web site. Your Marked List will be private to you, though we will look at adding an option to publish your List to make it public.

The fly in the ointment of all this might be Shibboleth (the UK Academic access management mechanism.) It isn’t clear to me if an Atom feed is going to work in a Shibbolized environment. I hope to have something to test soon and I’ll keep you informed…

Bookmarking Copac records

In a previous post, “Persistent identifiers for Copac records“, I said that we would soon be adding links from our Full record pages to bookmarking sites such as Delicious. Well, we have now added the links to Delicious!

We hope you find this functionality useful. Let us know if you think there are other such sites you think we should be linking to.

Persistent identifiers for Copac records

If you know the record number of a Copac record, there is now a simple url that will return you the record in MODS XML format. The urls take the following form: http://copac.ac.uk/crn/<record-number>. For instance, the work “China tide : the revealing story of the Hong Kong exodus to Canada” has a Copac Record Number of 72008715609 and can be linked to with the url http://copac.ac.uk/crn/72008715609.

Over the next few weeks we’ll be looking at adding these links to the Copac Full record pages and also introducing links to Bookmarking web sites such as delicio.us.

Institute of Education reload

Last week we started re-loading the Institute of Education Library records. Due to the number of records involved it will take a little while to complete the operation and as of today approximately half of the records are visible in the Copac interfaces. The rest of the records should be available this time next week.

The re-load was required to enable better access to live circulation information from the Institute’s Library Management System.

Search Solutions 2008

On Tuesday last I attended “Search Solutions 2008” organised the BCS-IRSG and to quote from event programme, “Search Solutions is a special one-day event dedicated to the latest innovations in information search and retrieval.” The format of the day was a series of short talks, 11 in all, each about 20 minutes in length with the chance for questions from the audience after each talk.

One of the themes through the day was the linguistic analysis of texts such as blog posts and web pages. Or in other words, deducing the correct meaning of a word like Georgia; is it referring to someone called Georgia, the country that used to be part of the USSR, or the USA State. As all the speakers were from commercial companies no-one was giving their secrets away, but approaches mentioned ranged from Bayesian analysis to a team of 50 linguistic experts.

Another theme was how social networking can help users find what they’re looking for. User recommendations and tagging were both cited frequently in this regard. Elias Pampalk from last.fm gave a very interesting talk on how tagging is being used on last.fm. They have made it very easy for users to tag. Adding a tag usually involves no typing — just a couple of mouse clicks to select either a tag you’ve used before or a tag someone else has used for that item. There is also incentive for people to tag at last.fm as it can help you discover new music and connect you to people with similar tastes. They seem to have gotten it right as they are collecting over 2.5 million tags per month.

At the end of his talk, Elias mentioned that last.fm had an open API, which I had never realised before. This got me wondering if we could provide links from Copac to last.fm. This perhaps isn’t as strange an idea as it may first seem. Copac doesn’t hold records for just books, we have many records in the database for CD and sheet music. It might be kind of neat to provide a link from those records to last.fm’s page about the artist or album and perhaps pull in images as well? Something to think about when we can find a bit of spare time.

Overall it was a very interesting day with many thought provoking talks and I’d happily attend a similar day next year.

Handling XML errors

I’ve just installed some updated software that should increase the reliability of the web service. Unfortunately, while I was installing the software people using the service will have seen error messages in place of our records. The disruption should only have lasted a minute or two and everything should be working now.

The update allows us to better cope with errors in the records. In the past an XML error in one record in a page of results was causing users to see a “500 Internal Server Error” page rather than their records. Things are now better, though not perfect. We still cannot display the record with the errors, but the rest of the records are displayed and there should be no more Internal Server Error pages because of bad XML. Records with errors will now show as follows in the brief display:

An undisplayable record in the Brief display.

An un-displayable record in the Brief display.

As I mentioned in a previous post our database software does not natively support XML and it is occaisionally inserting line-breaks where it shouldn’t — such as in the middle of an XML Entity! Our next task is to modify our line breaking algorithm (so that the database doesn’t need to do it itself) and correct the the affected records.