<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Re-structuring the database</title>
	<atom:link href="http://copac.ac.uk/development-blog/2008/08/re-structuring-the-database/feed/" rel="self" type="application/rss+xml" />
	<link>http://copac.ac.uk/development-blog/2008/08/re-structuring-the-database/</link>
	<description>What's happening behind the scenes at Copac</description>
	<lastBuildDate>Fri, 27 Jan 2012 19:08:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
	<item>
		<title>By: Getting to know the Copac libraries at Copac Developments</title>
		<link>http://copac.ac.uk/development-blog/2008/08/re-structuring-the-database/comment-page-1/#comment-96</link>
		<dc:creator>Getting to know the Copac libraries at Copac Developments</dc:creator>
		<pubDate>Tue, 27 Jan 2009 10:05:50 +0000</pubDate>
		<guid isPermaLink="false">http://copac.ac.uk/development-blog/?p=56#comment-96</guid>
		<description>[...] same manner as the Nielsen BookData cover images. This may have to wait until the new database (see this post of Ashleyâ€™s for what else the new database might hold ), but itâ€™s a feature that we are very [...]</description>
		<content:encoded><![CDATA[<p>[...] same manner as the Nielsen BookData cover images. This may have to wait until the new database (see this post of Ashleyâ€™s for what else the new database might hold ), but itâ€™s a feature that we are very [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Handling XML errors at Copac Developments</title>
		<link>http://copac.ac.uk/development-blog/2008/08/re-structuring-the-database/comment-page-1/#comment-19</link>
		<dc:creator>Handling XML errors at Copac Developments</dc:creator>
		<pubDate>Thu, 18 Sep 2008 10:13:56 +0000</pubDate>
		<guid isPermaLink="false">http://copac.ac.uk/development-blog/?p=56#comment-19</guid>
		<description>[...] I mentioned in a previous post our database software does not natively support XML and it is occaisionally inserting line-breaks [...]</description>
		<content:encoded><![CDATA[<p>[...] I mentioned in a previous post our database software does not natively support XML and it is occaisionally inserting line-breaks [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ashley</title>
		<link>http://copac.ac.uk/development-blog/2008/08/re-structuring-the-database/comment-page-1/#comment-11</link>
		<dc:creator>Ashley</dc:creator>
		<pubDate>Fri, 29 Aug 2008 10:25:08 +0000</pubDate>
		<guid isPermaLink="false">http://copac.ac.uk/development-blog/?p=56#comment-11</guid>
		<description>Hi Baptiste, thanks for your comment and yes, we are hoping to include relevance ranking in a new system. It will be interesting to see how well relevance ranking will work on our records. Some of our records are really very mininal, having little more than an author and a title (and sometimes not even an author), while others have extensive table of contents, notes and subject information.

The de-duplication should help as it brings together full and minimal records. Users will find the minimal records by virtue of us being able to associate them with the metadata from the fuller records. So a poorer record may be pulled up the rankings because it is part of a de-duplicated group of records.

I think it unlikely we can do much with our really poor records (those with just a title and little else.) It may not be possible to allocate them to a FRBR work level record and they may just have to sit alone and unloved in the database. :-)</description>
		<content:encoded><![CDATA[<p>Hi Baptiste, thanks for your comment and yes, we are hoping to include relevance ranking in a new system. It will be interesting to see how well relevance ranking will work on our records. Some of our records are really very mininal, having little more than an author and a title (and sometimes not even an author), while others have extensive table of contents, notes and subject information.</p>
<p>The de-duplication should help as it brings together full and minimal records. Users will find the minimal records by virtue of us being able to associate them with the metadata from the fuller records. So a poorer record may be pulled up the rankings because it is part of a de-duplicated group of records.</p>
<p>I think it unlikely we can do much with our really poor records (those with just a title and little else.) It may not be possible to allocate them to a FRBR work level record and they may just have to sit alone and unloved in the database. <img src='http://copac.ac.uk/development-blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Shirley Cousins</title>
		<link>http://copac.ac.uk/development-blog/2008/08/re-structuring-the-database/comment-page-1/#comment-10</link>
		<dc:creator>Shirley Cousins</dc:creator>
		<pubDate>Thu, 28 Aug 2008 08:47:09 +0000</pubDate>
		<guid isPermaLink="false">http://copac.ac.uk/development-blog/?p=56#comment-10</guid>
		<description>In response to Hugh Taylor re. the deduplication issue. 

The current deduplication process is actually already a two-part match process so the move to having both work level and manifestation level records won&#039;t actually add to the deduplication burden. 

With the deduplication process it would be too slow to match every incoming record with the entire database, so we do an initial quick and dirty match to create a set of potential duplicate records. The incoming record is then matched against the potential duplicates using a second, much more detailed, duplicate checking process to confirm or reject the initial match. 

We intend to refine the initial match procedure to bring together records for all the different manifestations of a work. This work level record set will then act as the set of potential duplicates for the second match stage, where we merge just those records for a specific manifestation of the work eg. the records relating to one particular edition.

So the match procedures will need to be changed but the overall work involved in matching should be much the same as now. 

However I think Hugh is right in suggesting that work level record matching may result in a higher level of misconsolidation of records. The broader the match the more likely it is that we will bring records together inappropriately. There are still some &#039;patient researchers&#039; out there who get in touch to report errors in the records, something we are planning to make easier to do. And we will put in place ways of dealing with reported consolidation errors to stop them happening again. But as Hugh says - &#039;A tricky one&#039; and the diversity of the records is always going to make it difficult to resolve.</description>
		<content:encoded><![CDATA[<p>In response to Hugh Taylor re. the deduplication issue. </p>
<p>The current deduplication process is actually already a two-part match process so the move to having both work level and manifestation level records won&#8217;t actually add to the deduplication burden. </p>
<p>With the deduplication process it would be too slow to match every incoming record with the entire database, so we do an initial quick and dirty match to create a set of potential duplicate records. The incoming record is then matched against the potential duplicates using a second, much more detailed, duplicate checking process to confirm or reject the initial match. </p>
<p>We intend to refine the initial match procedure to bring together records for all the different manifestations of a work. This work level record set will then act as the set of potential duplicates for the second match stage, where we merge just those records for a specific manifestation of the work eg. the records relating to one particular edition.</p>
<p>So the match procedures will need to be changed but the overall work involved in matching should be much the same as now. </p>
<p>However I think Hugh is right in suggesting that work level record matching may result in a higher level of misconsolidation of records. The broader the match the more likely it is that we will bring records together inappropriately. There are still some &#8216;patient researchers&#8217; out there who get in touch to report errors in the records, something we are planning to make easier to do. And we will put in place ways of dealing with reported consolidation errors to stop them happening again. But as Hugh says &#8211; &#8216;A tricky one&#8217; and the diversity of the records is always going to make it difficult to resolve.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Baptiste Manson</title>
		<link>http://copac.ac.uk/development-blog/2008/08/re-structuring-the-database/comment-page-1/#comment-7</link>
		<dc:creator>Baptiste Manson</dc:creator>
		<pubDate>Wed, 27 Aug 2008 13:46:04 +0000</pubDate>
		<guid isPermaLink="false">http://copac.ac.uk/development-blog/?p=56#comment-7</guid>
		<description>What about relevance on the results ?
Is it planned ?
Cheers and continue the good work on Copac.</description>
		<content:encoded><![CDATA[<p>What about relevance on the results ?<br />
Is it planned ?<br />
Cheers and continue the good work on Copac.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hugh Taylor</title>
		<link>http://copac.ac.uk/development-blog/2008/08/re-structuring-the-database/comment-page-1/#comment-5</link>
		<dc:creator>Hugh Taylor</dc:creator>
		<pubDate>Wed, 27 Aug 2008 07:23:36 +0000</pubDate>
		<guid isPermaLink="false">http://copac.ac.uk/development-blog/?p=56#comment-5</guid>
		<description>Clearly restructuring the database should provide a better environment for handling holdings data as well as ease some of the pain associated with updates. There&#039;s a limited amount contributors can do to help with the latter, of course - you can already tell from MARC 005s whether a bib or holdings record has been updated since you last received it from us, but in the current database model you have no choice but to process both anyway. Anything that reduces that wasted &quot;effort&quot; is clearly a good thing.

I&#039;m slightly less clear re deduplication. I don&#039;t think having a work level and an expression/manifestation level - interesting as that is in itself - quite addresses what I perceive to be the gist of the comments you refer to at the beginning of that para. The real issue here seems to be about the matching algorithm(s) - whether this is the same as that. By associating all these expression/manifestation records not only with each other but with a &quot;work&quot; you are adding to the deduplication. The fact that the patient researcher will eventually be able to determine that some of these records are inappropriately linked may be a bonus (if there are any patient researchers left out there...), but it means there&#039;s likely to be more such knots to untangle in the first place. A tricky one which I don;t envy you trying to tackle!</description>
		<content:encoded><![CDATA[<p>Clearly restructuring the database should provide a better environment for handling holdings data as well as ease some of the pain associated with updates. There&#8217;s a limited amount contributors can do to help with the latter, of course &#8211; you can already tell from MARC 005s whether a bib or holdings record has been updated since you last received it from us, but in the current database model you have no choice but to process both anyway. Anything that reduces that wasted &#8220;effort&#8221; is clearly a good thing.</p>
<p>I&#8217;m slightly less clear re deduplication. I don&#8217;t think having a work level and an expression/manifestation level &#8211; interesting as that is in itself &#8211; quite addresses what I perceive to be the gist of the comments you refer to at the beginning of that para. The real issue here seems to be about the matching algorithm(s) &#8211; whether this is the same as that. By associating all these expression/manifestation records not only with each other but with a &#8220;work&#8221; you are adding to the deduplication. The fact that the patient researcher will eventually be able to determine that some of these records are inappropriately linked may be a bonus (if there are any patient researchers left out there&#8230;), but it means there&#8217;s likely to be more such knots to untangle in the first place. A tricky one which I don;t envy you trying to tackle!</p>
]]></content:encoded>
	</item>
</channel>
</rss>

