Behind the Copac record

Written by bethan : with 2 comments

We’re going to be talking quite a lot about the Copac reengineering, including the move to FRBRise Copac, and in order for you to have some idea of how this is going to change what we do, you need to know what we do now.  So here’s a brief background on the life of a Copac record.

Records are sent to us by the contributing institutions, usually in MARC exchange format, which looks like this:

An unprocessed MARC exchange file

An unprocessed MARC exchange file

We then run this through programmes created by our wonderful programmers (and about which I know very very little, except that they’re fantastic and save both my eyes and my sanity), which create records that look like this:

A processed MARC file

A processed MARC file

This is much easier on the eye, which is fortunate, as this is the stage where I use the warning file (also generated by the program) to look through and track down any possible errors. This is mainly only done when loading a new library – once a library has been loaded, we just keep an eye on their updates to identify any changes, or new issues that arise.

For instance, the warning file might say ‘WARNING: LONG NAME IN 100 MAY NOT BE PERSONAL NAME  REC 92765′.  I would then look up that record, and check whether the long name in the 100 is, in fact, a personal name, or if it is a corporate name and needs to be in a 110.

This program has been evolving ever since the start of Copac, and it’s now able to handle most changes with very little need for human intervention.  Therefore, when I see ‘WARNING: 700 ’1 $aDaille, Jean, 1594-1670.’ CHANGED TO ’1 $aDaille, Jean,$d1594-1670.’, I know that I don’t need to do anything – that change is correct.
Some warnings do need looking at in more depth.  If I see a warning that says something along the lines of ‘WARNING: NO 245 IN REC 76932.  240 CONVERTED TO 245′, then I will look at the original record and the altered record to see if that change is correct.

At this stage we’ll also check if there are any generic fields being used in a local way, that notes are in the correct notes fields, and that all records have holdings information.  Note that we’re largely not in a position to assess the quality of the data in the fields – purely that the right sort of data is in the right fields.  We wouldn’t, for example, correct typos in author’s names or incorrect publication dates.  As well as the fact that doing so would require making judgements, and make the whole process simply unmanageable, the data on Copac belongs to the contributing libraries, and so they are the ones who would need to make any corrections to the content.  Thus, in general,  the only changes we would make are to the MARC structure (or occasionally to the encoding of special characters), to try to ensure standardised data for record sharing and for building Copac.  The  data content of the fields we leave exactly as they are.

Once we’re satisfied that all this is correct, the data is loaded onto the RLUK shared cataloguing database in MARC21 format, where it is available for use by RLUK members and customers.  Back in the Copac office, it’s time for another round of processing, before the data is loaded onto Copac.  More on that next time!

Written by bethan

January 26th, 2010 at 2:02 pm

Posted in Database

Tagged with , ,

Database update

Written by Ashley : without comments

We’ve had a recurrence of the problem I reported a month ago and so last night we installed an update to the database software we use. I’m told the update contains fixes relevant to the problems we have been experiencing, so here’s hoping it brings some increased reliability with it.

Please accept out apologies if you experienced some disruption last night while I was updating the software.

Written by Ashley

November 13th, 2009 at 10:29 am

Posted in Database

Tagged with ,

Yesterday’s loss of service

Written by Ashley : with one comment

I thought I’d write a note about why we lost the Copac service for a couple of hours yesterday.

The short of it is, that our database software hung when it tried to read a corrupted file in which it keeps track of sessions. The result was that everyone’s search process hung and so frustrated users kept re-trying their searches, which created more hung sessions until the system was full of hung processes and with no CPU or memory left. Once we had deleted the corrupted file, everything was okay.

The long version goes something like this… From what I remember, things started going pear-shaped a little before noon when the machine running the service started becoming unresponsive. A quick look at the output of top showed we had far more search sessions running than normal and that the system was almost out of swap space.

It wasn’t clear why this was happening and because the system was running out of swap it was very difficult to diagnose the problem. It was difficult to run programs from the command line as, more often than not, they immediately died with the message “out of memory.” I did manage to shutdown the web server in an effort to lighten the load and stop more search sessions being created. It was proving almost impossible to kill off the existing search sessions. In Unix a “kill -9″ on a process should immediately stop the process and release its memory back to the system. But yesterday a “kill -9″ was having no effect on some processes and those that we did manage to kill were being listed as “defunct” and still seemed to be holding onto memory. In the end we just thought it would be best to re-boot the system and hope that it would solve whatever the problem was.

It took ages for the system to shut itself down – presumably because the shutdown procedures weren’t working with no memory to work in. Anyway, it did finally reboot and within minutes of the system coming up it became overloaded with search sessions and ran out of memory again.

We immediately shut down the web server again. However, search sessions were still being created by people using Z39.50 and so we had to edit the system configuration files to stop inetd spawning more Z39.50 search sessions. Editing inetd.conf didn’t prove to be the trivial task it should have been, but we did get it done eventually. We then tried killing off the 500 or so search sessions that were hogging the system — and that proved difficult too. Many of the processes refused to die. So, after sitting staring at the screen for about 15 minutes, unable to run programs because there was no memory and wondering what on earth do we do now, the system recovered itself. The killed off processes did finally die, memory was released and we could do stuff again!

A bit of investigation showed that the search processes weren’t getting very far into their initialisation procedure before hanging or going into an infinite loop. I used the Solaris truss program to see what files the search process was reading and what system calls it was making. Truss showed that the process was going off into cloud cuckoo land just after reading a file the database software uses to track sessions. So I deleted that file and everything started working again! The file got re-created next time a search process ran — presumably the file had become corrupted.

Written by Ashley

October 15th, 2009 at 1:54 pm

Posted in Database,Interfaces

Tagged with ,

Issues searching other library catalogues

Written by Ashley : without comments

Some of you may have noticed that there is now a facility on the Copac search forms to search your local library catalogue as well as Copac. You’ll only see this option if you have logged into Copac and are from a supported library.

The searching of the local library catalogues and Copac is performed using the Z39.50 search protocol. Due to differences in local configurations the query we send to Copac and the various library catalogues have to be configured very differently.

When we built the Copac Z39.50 server, we tried to make it flexible in the type of query it would accept within the limitations imposed upon us by the database software we use. Our database software was made for keyword searching of full text resources. As such it is good at adjacency searches, but you can’t tell it you want to search for a word at the start of a field.

Databases built around relational databases tend to be the complete opposite in functionality. They often aren’t good at keyword searching, but find it very easy to find words at the start of a field.

The result of which is that we make our default search a keyword search, while some other systems default to searching for query terms at the start of a field. Hence if we send the exact same search to Copac and a library catalogue we can get a very different result from the two systems. To try and get a consistent result we have to tweak the query sent to the library so that it performs a search as near as possible to that performed by Copac. Working out how to tweak (or transform or mangle) the queries is a black art and we are still experimenting.

Stop word lists are also an issue. Some library systems like to fail your search if you search for a stop word. Better systems just ignore stop words in queries and perform the search using the remaining terms. The effect is that searching for “Pride and prejudice” fails on some systems because “and” is stop worded. To get around this we have to remove stop words from queries. But we first need to know what the stop words are.

The result is that the search of other library systems is not yet as good as it could be, though it will get better over time as we discover what works best with the various library systems that are out there.

Written by Ashley

August 14th, 2009 at 4:16 pm

Logging in to Copac: some tips

Written by bethan : without comments

Now that you have the option to log-in to Copac to use the personalisation features, here are some tips to make logging in as easy as possible.

Typekey/Typepad:  if you have a Typekey or Typepad account, and were wondering where your login option was, worry no longer!  From the drop-down list of organisations on the login page, you need to choose ‘JISC project: SDSS (TypeKey Bridge)’.  It’s not immediately obvious, but it is the correct login option for any TypeKey users.

Navigating the list:  the list of organisations is very long, and weighted heavily towards ‘U’.  To navigate it more easily, you can jump straight to any letter by typing it on your keyboard.  You may find it even easier to enter a keyword search in the search box.  This will work for partial words as well – entering ‘bris’ will give you the options of the City of Bristol College and the University of Bristol.

Remembering your selection:  once you have found your organisation, there are options to have your selection remembered, either for that session (the default) or for a week.  You can also choose ‘do not remember’, which is especially useful if you are on a public computer.

Please contact us if you experience any problems with logging in to Copac.

Written by bethan

August 12th, 2009 at 10:25 am

Posted in Interfaces

Tagged with ,

New Copac interface

Written by bethan : with 3 comments

It’s finally here!  After months of very hard work from the Copac team, and lots of really useful input from users on the Beta trials, the new Copac interface is now live.

We have streamlined the Copac interface, and you can still search and export records without logging in to Copac. This is ideal if you want to do a quick search, and don’t need any of the additional functionality.  Users who choose not to login will still be able to use the new functionality of exporting records directly to EndNote and Zotero, and will see book and journal tables-of-contents, where available.

You now also have the option to login to Copac.  This is not compulsory, and you only need to login if you want to take advantage of the full range of new personalisation features.   These have been developed to help you to get the most out of Copac, and to assist in your workflows.

‘Search History’ records all of your searches, and includes a date/time stamp.  This allows you to keep track of your searches, and to easily re-run any search with a single click.

‘My References’ allows you to manage your marked records, and create an annotated online bibliography.

You can annotate and tag all of your searches and references.  There is no limit to how you can use this functionality:  see my post from March for some suggestions about how you might use tags and annotations.  We would love to hear how you are using them – please get in touch if you would like to share your experiences and ideas.

Users from some institutions will now have the option to see their local catalogue results appearing alongside the Copac results.  We are harvesting information from the institutions’ Z39.50 servers, and using this to create a merged results set.  If you are interested in your institution being a part of this, please get in touch.

Some people have expressed concern that the need to login means that Copac is going to be restricted to members of UK academic institutions only.  This is not the case.  We are committed to keeping Copac freely-accessible to all.  Login is required for the new features to function:  we need to be able to uniquely identify you in order to record your search history and references; and we need to know which (if any) institution you are from to show you local results.  We have tried to make logging in as easy as possible.  For members of UK academic institutions, this means that you can use your institution’s central username/password, or your ATHENS details  For our users who aren’t members of a UK academic institution, you can create a login from an identity provider: ProtectNetwork and TypePad.  These providers enable you to create a secure identity, which you can use to manage access to many internet sites.

We are very grateful to everyone who has taken the time to give us feedback on the recent Beta trials.  But we can never get enough feedback!  We’d love to hear what you think about the new Copac interface:  you can email us; speak to us on twitter; or leave comments here.

Written by bethan

August 10th, 2009 at 4:07 pm

Copac Beta can search your library too

Written by Ashley : with one comment

One of the new features we are trailing in the new Copac Beta is the searching of your local institutions library catalogue alongside Copac. To do this we need to know which Institution you are from and whether or not your Institutional library catalogue can be searched with the Z39.50 protocol.

To identify where you are from, we are using information given to us during the login process. When you login, your Institution gives us various pieces of information about you, including something called a scoped affiliation. For someone logging in from, say, the University of Manchester, the scoped affiliation might be something like “student@manchester.ac.uk”

Once we know where you are from, we search a database of Institutional Z39.50 servers to see if your Institution’s library is searchable. If it is we can present the extra options on the search forms, and indeed, fire off any queries to your library catalogue.

Our database of Z39.50 servers is created from records harvested from the IESR. So, if you’d like your Institution’s catalogue available through Copac, make sure it is included in the IESR by talking to the nice people there.

Many thanks to everyone who tried the Beta interface early on and discovered that this feature mostly wasn’t working. You enabled us to identify some bugs and get the service working.

Written by Ashley

July 9th, 2009 at 12:08 pm

Posted in D2D,Interfaces

Tagged with , , ,

Notes on (Re)Modelling the Library Domain (JISC Workshop).

Written by joy : without comments

A couple of weeks ago, I attended JISC’s Modelling the Library Domain Workshop. I was asked to facilitate some sessions at the workshop, which was an interesting but slightly (let’s say) ‘hectic’ experience. Despite this, I found the day very positive. We were dealing with potentially contentious issues, but I noted real consensus around some key points. The ‘death of the OPAC’ was declared and no blood was shed as a result. Instead I largely heard murmured assent. As a community, we might have finally faced a critical juncture, and there were certainly lessons to be learned in terms of considering the future of services such as Copac, which as a web search service, in the Library Domain Model would count as national JISC service ‘Channel.’

In the morning, we were asked to interrogate what has been characterised as the three ‘realms’ of the Library Domain: Corporation, Channels, and Clients. (For more explanation of this model, see the TILE project report on the Library Domain Model). My groups were responsible for picking apart the ‘Channel’ realm definition:

The Channel: a means of delivering knowledge assets to Clients, not necessarily restricted to the holdings or the client base of any particular Corporation, Channels within this model range from local OPACs to national JISC services and ‘webscale’ services such as Amazon and Google Scholar. Operators of channel services will typically require corporate processes (e.g. a library managing its collection, an online book store managing its stock). However, there may be an increasing tendency towards separation, channels relying on the corporate services of others and vice versa (e.g. a library exposing its records to channels such as Google or Liblime, a bookshop outsourcing some of its channel services to the Amazon marketplace).

In subsequent discussion, we came up with the following key points:

  • This definition of ‘channel’ was too library-centric. We need to working on ‘decentring’ our perspective in this regard.
  • We will see an increasing uncoupling of channels from content. We won’t be pointing users to content/data but rather data/content will be pushed to users via a plethora of alternative channels
  • Users will increasingly expect this type of content delivery. Some of these channels we can predict (VLEs, Google, etc) and others we cannot. We need to learn to live with that uncertainty (for now, at least).
  • There will be an increasing number of ‘mashed’ channels – a recombining of data from different channels into new bespoke/2.0 interfaces.
  • The lines between the realms are already blurring, with users becoming corporations and channels….etc., etc.
  • We need more fundamental rethinking of the OPAC as the primary delivery channel for library data. It is simply one channel, serving specific use-cases and business process within the library domain.
  • Control. This was a big one. In this environment libraries increasingly devolve control of the channels via which their ‘clients’ use to access the data. What are the risks and opportunities to be explored around this decreasing level of control? What related business cases already exist, and what new business models need to evolve?
  • How are our current ‘traditional’ channels actually being used? How many times are librarians re-inventing the wheel when it comes to creating the channels of e-resource or subject specialist resource pages? We need to understand this in broad scale.
  • Do we understand the ways in which the channels libraries currently control and create might add value in expected and unexpected ways? There was a general sense that we know very little in this regard.

There’s a lot more to say about the day’s proceedings, but the above points give a pretty good glimpse into the general tenor of the day. I’m now interested to see what use JISC intends to make of these outputs. The ‘what next?’ question now hangs rather heavily.

Written by joy

July 3rd, 2009 at 3:04 pm

It’s Official — Copac’s Re-engineering

Written by joy : with one comment

We’ve been hinting a while now about significant changes being imminent for Copac, and I am now pleased to announce that we’ve had official word that we have secured JISC funding to overhaul the Copac service over the next year.

The major aim for this work is to improve the Copac user experience.  In the short term this will mean improving the quality of the search results.  More broadly, this will mean providing more options for personalising and reusing Copac records.

We’re going to be undertaking the work in two phase.  We’re calling Phase 1 the ‘iCue Project’ (stands for ‘Improving the Copac User Experience’).  This work will be focused on  investigating and proposing pragmatic solutions that improve the Copac infrastructure and end-user experience, and we’re going to be partnering with Mark Van Harmelen of Personal Learning Environments Ltd (PLE) in this work (Mark is also involved in the JISC TILE project, so we believe there’s a lot of fruitful overlap there, especially around leveraging the potential of circulation data a la Huddersfield).  The second phase is really about doing the work — re-engineering Copac in line with the specifications defined in the iCue Project.

We see this work tackling three key areas for Copac:

(i) Interface revision: We’ll be redesigning Copac’s user interface, focusing on areas of usability and navigability of search results. We are aware that the sheer size of our database and our current system means that searches can return large, unstructured result sets that do not facilitate users finding what they need.  Addressing this is a major priority.  We’ll be building on the CERLIM usability report we recently commissioned (more on that in another post) and also drawing on the expertise of OPAC 2.0 specialists such as Dave Pattern.  We’ll also be working consistently with users (librarian users and researcher users) to monitor and assess how we’re doing.

(ii) Database Restructuring: A more usable user interface is going to critically rely on a suitable restructuring of Copac’s database. Particularly, we are centrally interested in FRBR (Functional Requirements for Bibliographic Records) as a starting point for a new database structure. We anticipate that whatever we learn as we undertake this piece of work will be of interest to the broader community, and plan to disseminate this knowledge, and update the community via this blog.

(iii)  De-duplication: The restructuring implies further de-duplication of Copac’s contents, and so we’re also developing a de-duplication algorithm.  Ideally we would like to see the FRBR levels of work, expression, manifestation and (deduplicated) item being supported, or a pragmatic version of the same.

The end user benefits:
1. Searches are faster and more effective (Copac database is more responsive and robust; users are presented with a more dramatically de-duplicated results view)
2.  Search-related tasks are easier to perform (i.e. the flexibility of this system will support the narrowing/broadening of searches, faceted searching, personalising/sharing content)
3.  Access to more collections (Copac database is able to hold more content and continue to grow)

So there we have it.  It’s going to be quite a year for the Copac team.  If you have any questions, comments or suggestions you’d like us to take on board, do leave a comment here or email us.  (Not that this will be the only time we ask!) We can also be chatted to via twitter @Copac.

Written by joy

May 1st, 2009 at 3:25 pm

Posted in Uncategorized

Catalogues as Communities? (Some thoughts on Libraries of the Future)

Written by joy : with one comment

At last week’s Libraries of the Future debate, Ken Chad challenged the presenters (and the audience) over the failure of libraries to aggregate and share their data.  I am very familiar with this battle-cry from Ken.  In the year+ that I’ve been managing Copac, he’s (good-naturedly) put me on the spot several times on this very issue.  Why isn’t Copac (or the UK HE/FE library community) learning from Amazon, and responding to user’s new expectations for personalisation and adaptive systems?

Of course, this is a critically important question, and one that is at the heart of the JISC TILE project, which Ken co-directs (I actually sit on the Reference Group). Ken’s  related argument is that the public sector business model (or lack thereof) is perhaps fatally flawed, and that we are probably doomed in this regard; private sector is winning already on the personalisation front, so instead of pouring public money into resource discovery ‘services’ we should instead, perhaps, let the market decide.  I am not going to address the issue of business models here – although this is a weighty issue requiring debate – but I want to come back to this issue of personalisation, 2.0, and the OPAC as a potential ‘architecture for participation.’

I fundamentally agree with the TILE project premise (borrowed from Lorcan Dempsey) that the library domain needs to be redefined as a set of processes required for people to interact with ‘stuff’.  We need to ask ourselves if the OPAC itself is a relic, an outmoded understanding of ‘public access’ or (social) interaction with digital content. As we do this, we’re creating heady visions where catalogue items or works can be enhanced with user-generated content, becoming ‘social objects’ that bring knowledge communities together.  ‘Access’ becomes less important than facilitating ‘use’ (or reuse) and the Discovery to Delivery paradigm is turned on its head.

It’s the ‘context’ of the OPAC as a site for participation that I am interested in questioning.  Can we simply ‘borrow’ from the successful models of Amazon or LibraryThing? Is the OPAC the ‘place’ or context that can best facilitate participative communities?

This might depend on how we’re defining participation, and as Owen Stephens has suggested (via Twitter chats) what the value of that participation is to the user.  In terms of Copac’s ‘My References’ live beta, we’ve implemented ‘tagging with a twist,’ where tagging is based on user search terms and saved under ‘Search History’.  The value here is fairly self-evident – this is a way for users to organise their own ‘stuff’. The tagging facility, too, can be used to self-organise, and as Tim Spalding suggested way back in 2007, this is also why tagging works for LibraryThing (and why it doesn’t work for Amazon). Tagging works well when people tag “their” stuff, but it fails when they’re asked to do it to “someone else’s” stuff. You can’t get your customers to organize your products, unless you give them a very good incentive.

But does this count as ‘community’ participation?  Right now we don’t provide the option for tags to be shared, though this is being seriously considered along the lines of a recommender function: users who saved this item, also saved which seems to be a logical next step, and potentially complimentary to Dave’s recommender work. However,  I’m much less convinced about whether HE/FE library users would want to explicitly share items through identity profiles, as at LibraryThing.  Would the LibraryThing community model translate to the models that university and college libraries might want to support the semantically dense and complex communities for learning, teaching and research?

One of the challenges for a participatory OPAC 2.0 (or any a cross-domain information discovery tool) will be the tackling of user context, and specifically the semantic context(s) in which that user is operating.  Semantic harvesting and text mining projects such as the Intute Repository Search have pinpointed the challenge of ‘ontological drift’ between disciplines and levels (terms and concepts having shifted meanings across disciplinary boundaries).  As we move into this new terrain of Library 2.0 this drift will likely become all the more evident.  Is the OPAC context too broad to facilitate the type of semantic precision to enable meaningful contribution and community-building?

Perhaps attention data, that ‘user DNA,’ will provide us with new ways to tackle the challenge.  There is risk involved, but some potential ‘quick wins’ that are of clear benefit.  Dave’s blog posts over the last week suggest that the value here might be in discovering people ‘like me’ who share the same research interests and keep borrowing books like the ones I borrow (although, if I am an academic researcher, that person might also be ‘The Competition’ — so there are degrees of risk to account for here — and this is just the tip of the ice-berg in terms of considering the cultural politics of academia and education).  Certainly the immediate value or ‘impact of serendipity’ is that it gives users new routes into content, new paths of discovery based on patterns of usage.

But what many of us find so compelling about the circulation data work is that it surfaces latent networks not just of books, but of people.  These are potential knowledge communities or what Wenger might call Communities of Practice (CoP).  Whether the OPAC can help nurture and strengthen those CoPs is another matter. Crowds, even wise ones, are not necessarily Communities of Practice.

The reimagining the library means reimagining (or discarding) the concept of the catalogue.  This might also mean rethinking the  OPAC as a context for community interaction.

—————–

[Related 'watch this space' footnote: We've already garnered some great feedback on the 'My References' beta we currently have up -- over 80 user-surveys completed (and a good proportion of those from non-librarian users).  This feedback has been invaluable.  Of course, before we embark on too many more 2.0 developments, Copac needs to be fit-for-purpose.  In the next year we are re-engineering Copac, moving to new hardware, restructuring the database,  improving the speed and search precision, and developing additional (much-needed) de-duplication algorithms.  We're also going to be undertaking a complete  overhaul of the interface (and I'm pleased to say that Dave Pattern is going to be assisting us in this aspect). In addition, as Mimas is collaborating on the TILE project through Copac, we're going to look at how we can exploit what Dave's done with the Huddersfield circulation data (and hopefully help bring other libraries on board).]

Written by joy

April 7th, 2009 at 10:14 am