1

Topic: Standards for uniquely identifying biological records

As biological records start to move more and more freely around the biological recording network, the problem of duplicate records will only become worse. The  root of the problem is that when a record moves from one database/software implementation to another, these implementations normally have different ways of uniquely identifying records and so instead of honouring the unique identifier that a record comes with, they assign a new one. This increases the chances that the two records will, in the future, be considered as different biological records rather than the same biological record with two different keys – especially if one, or both, of them is edited or otherwise modified in some way.

If different biological recording system providers and implementers could agree on a standard for referencing biological records, the worst of these problems could be avoided going forward. One way that suggests itself is by using the Open Software Foundation’s (OSF) standard for Universally Unique Identifiers (UUIDs) as implemented, for example, by Microsoft’s Globally Unique Identifiers (GUIDs).

As a community we need to start taking a proactive approach to developing technical solutions to the problem of duplicate records. The NBN (through involvement NBN Gateway, Indicia and iRecord), JNCC (through involvement in Recorder 6 development) and BRC (through involvement in iRecord) are key players in the development of UK biodiversity informatics systems and standards. Is there a debate within and/or between these organisations around the possibility of standardising references?
   
Whether UUIDs or another standard for assigning unique identifiers were used, the crucial thing would be that all biological recording software providers should implement it. If the influential organisations in the UK biodiversity informatics development community effectively promoted a standard to developers and users of biological recording software, then conformance to the standard would come to be regarded as a ‘selling point’ and more likely to be implemented.

Technically, a standardised reference needn’t be difficult to implement. In the first instance, it could simply sit alongside the reference systems currently used by existing software. The key thing is that it would be honoured when records move between systems.

Richard Burkmar
Biodiversity Project Officer
Field Studies Council

2

Re: Standards for uniquely identifying biological records

An excellent idea.  Charles Copp in the NBN Standards  on which  Recorder 2000 was based introduced two concept  which are worth preserving in some way. Firstly that ever record should have a unique identifier which it takes with it wherever it goes, and secondly that each record should have a Custodian, with only the Custodian having the right to change the record. . 

Having a unique identifier which is not system based will achieve the first objective. 

The second objective is more difficult to achieve, because users like to change records they import, if for no other reason to than to fit the dictionaries and structures  of their system.   However, this isn’t a big issue as long as the altered record is not then  passed on to other systems.  One of the simplest  ideas I have seen to  overcome this is to have an ownership or custodianship token attached to each record.  The system which holds the token, can pass the record on, give consent for other to pass the record on,  and they can transfer the token. Systems not holding the token can use the record,  can modify it for their own use, but can not pass it on without the token holders permission, and even then only without the token.   The advantage of the token concept,  as with your GUID approach,  is that  it doesn’t require  the maintenance  of external  records and each system can implement it in any way they like. All that is required is a way of transferring the token in situations where this is required.

Mike Weideli

3

Re: Standards for uniquely identifying biological records

Surely the eventual aim is to have all records available via the NBN so that the exchange of data isn't necessary. Everyone can then get to the latest data by downloading it temporarily  from the Gateway as they need it.

4

Re: Standards for uniquely identifying biological records

David, whatever the eventual aim, the reality is that data move around all over the place via many different channels and often pass through a number of different systems - some of which attach their own type of unique identifier in place of any the record came with. And the reality is that this causes data duplication on the NBN Gateway and many biological record information systems (e.g LRC databases). I don't think that we can afford to close our eyes to the problem and say it will all be okay once all biological records can only be accessed and used via the NBN Gateway. Even assuming that this is a desireable goal, by the time we reach that point, the problem of duplicate records would be huge.

Mike do you know of anything happening in the biodiversity informatics sector to advance these ideas?

Rich

Rich

Richard Burkmar
Biodiversity Project Officer
Field Studies Council

5

Re: Standards for uniquely identifying biological records

I'm not sure how well something like this would actually work in practice.

If I record a tree sparrow and effectively put it into the public domain with a grid reference, recorder and date (easy enough to do on any or all of the photo storage sites) and also submit it to iRecord as a record and treesparrows.com as well as announcing it on facebook and twitter and include it in, for example 20 different flickr groups and Wild about Britain, then it seems that I could have created around 30 duplicate records all floating around the internet for anyone interested in tree sparrows to use (and for them to duplicate a bit more, perhaps with omission or change of some information). I just wonder if it's a wee bit better to tackle this at the end user's point of contact with the use of some sensible SQL filtering, identifying what is probably duplicate and sorting it out prior to use for whatever nefarious activities people get up to with biological records?

It might be easier to accept that duplication of records will always occur and as users of that data, have procedures in place that remove duplication within datasets where practicable following an agreed set of rules for identifying what is a duplicate record. I just think it is now very easy to make sightings available to so many people through so many online resources that duplication of records is inevitable. I'm also not convinced that duplication of records is all that evil when it's relatively simple to set up procedures to ensure that the data used is relevant to the project it's being used for.

Iain

6

Re: Standards for uniquely identifying biological records

Duplication is annoying and if there are some simple ways of identifying duplicates  I think they should be adopted. However, although true duplication is annoying it can, as you say,  be picked up. The biggest problem is where records are altered in some way, Usually this is done so that the data will fit into the format of the importing  system.  For example the species is changed, the grif reference altered in precision , a vague date is changed to something more specific, the recorder's or determiners' name is lost or interpreted,   or abundance  information interpreted in a different way. Without some reference  back to the source these look like new records and there is not much hope of identifying them as duplicates. This isn't a problem as long as the records are not then circulated as though they are new data.  For example if data is imported from the Gateway into a system for reporting purposes, it should not be distributed as though it it new data.  There is also a question of data ownership here. I would be extremely unhappy if any of the records I have submitted to the gateway (via LRC's) ended up there duplicated as new records, especially if they had been changed in some way.  If this were to happen too often you would find data providers reluctant to allow their data to be made generally available. which would not be in anyones interest.

Mike Weideli

7

Re: Standards for uniquely identifying biological records

Hi Iain,

I understand where you are coming from here; we will never completely eliminate the problem of duplicate records for the sorts of reasons you mentioned. And I also agree that a certain amount can be done to eliminate duplicates at the point of use, but only a certain amount and there are all sorts of overheads and technical challenges involved in doing this. Therefore if we can minimise the need for it, so much the better.

But I also agree strongly with Mikes comments, especially the way that records can change (including determinations) making filtering at the point of use practially impossible in some cases. I don't believe that because it is a difficult problem to deal with at source, we shouldn't attempt to deal with it at all. A common standard for unique record identifiers and honouring those identifiers is not a technically difficult solution to implement - it's more a culturally problematic one. But by adopting and promoting *standards*, we can start to affect a gradual cultural change without actually imposing it on reluctant developers.

Since I wrote the original post, Paula Lightfoot sent me a couple of documents that show that these issues are starting to be addressed by the biodiversity informatics community. One titled "NBN Standards for integrated online recording and verification" and the other "Briefing for technical meeting on online verification of records". The latter was produced by Steve Wilkinson of JNCC. (I'm not sure if either is available on the web.) Both these documents allude to the problem of unique identifers amongst other things.

(One slightly puzzling aspect of them - to me - is that the issues of data exchange and dataflow are firmly couched in terms of *online* recording. But the problems are common to any systems that generate, manage and exchange biological records, whether record entry happens online or not.)

Rich

Richard Burkmar
Biodiversity Project Officer
Field Studies Council