1

Topic: Unique ID

Hi

Just curious if Indicia has an inbuilt sort of mechanism for 'tagging' records with some sort of unique identifier, maybe a GUID based thing? I presume something tending towards unique can be derived from occurrence data and a warehouse ID if such a creature exists?

Any ideas?

Ta

Iain

2

Re: Unique ID

Certainly it has been discussed.
For example with thoughts of managing data that might be brought together from multiple warehouses.
If I use the Utilities > Easy Download form there is a column labelled RecordKey in the exported file.
Checkout the report used for generating the download and you will see

<column name='recordkey' display='RecordKey' sql="'iBRC' || co.id" datatype="text" />

So, 'iBRC' is being prepended to the occurrence id to generate a guid of sorts, provided that each warehouse modifies this report and adds its own unique id. Obviously far from fool proof at the moment.

Jim Bacon

3

Re: Unique ID

Agreed - this report's record prefix should be parameterised.

I always considered the unique id of a record to be the warehouse URL + the occurrence ID.

John van Breda
Biodiverse IT

4

Re: Unique ID

thanks for the responses - could the occurrence 'External key' be a home for such a beast?

5

Re: Unique ID

Hi Iain,

Interesting. A couple of thoughts.

The GUID is only needed when data is exported so there is perhaps no need to store it in the Warehouse.

However, if one warehouse were aggregating records from several others, a GUID would need storing and the Occurrence External Key is where it would naturally go.

As you may know, the External Key is a field where we store a GUID if importing records from another database, e.g. MapMate. This allows ongoing synchronisation with the originating database.

It probably defeats the purpose of a GUID to give a record a new one so, if a record is imported from MapMate, the MapMate GUID should be propagated when exporting the record. The current export query fails to do this.

Jim Bacon.

6

Re: Unique ID

Could you help with something that I do not quite understand?  I was quite surprised to read on this thread that a unique record key was not an inbuilt requirement within Indicia, if only for database referential integrity.  I had assumed that it would be there and in the Recorder format.

In the latest NBN eNews there are a number of very interesting developments regarding iRecord, which I believe is based upon Indicia. http://nbn.org.uk/News/Latest-news/New- … ecord.aspx
They say that there is now a download facility for LRCs.  It is in NBN Exchange Format and "The ‘top copy’ of the data remains in the iRecord database and the record ID remains with the dataset."  So this implies that there is already a unique key for each record in Indicia/iRecord?

Mike Beard
Natural Course Project Officer
Greater Manchester Local Records Centre

7

Re: Unique ID

Hi Mike

Internally, an Indicia Warehouse has a unique key for every record which is a sequential number. The thing is that there may be a number of Indicia Warehouses created which would all be giving records the same internal numbers - they are not Globally Unique Identifiers (GUIDs).

The way this is currently being managed is that, at export, records from the BRC warehouse, which iRecord is using, are being given a record key of "iBRC" plus the internal record id. Other warehouses would need to prepend a different code for them to be globally unique and there is still a risk that two warehouses might be set up with the same prefix.

As John says above, a reliable solution is to use the warehouse url as the prefix. This would be very easily accomplished. Just another aspect of the on-going development.

Jim Bacon.

8

Re: Unique ID

Hi

In practical terms (and a bit of thinking out loud) I can see that if a MapMate record finds its way into a warehouse and has a persistent identifier (from now on PID) that record would be considered a copy. So, if someone exports from MapMate to excel excluding the PID and then passes this on for inclusion into a dataset there may be a duplication of records from that route which can be resolved through user training perhaps? So it might be that I'm given an excel worksheet, but I don't know if this has come from MapMate or otherwise and now I'm beginning to see that the problem is how people get from individually recording something to contributing that something into a global arena with a persistent identifier.

How things are recorded by an individual (just thinking electronically here) and this is a generalisation to reflect a multitude of different ways

1) by phone (direct to warehouse)
2) online recording form (direct to warehouse)
3) excel or other non-dedicated software (not obvious how to create PID)
4) local installation of dedicated recording software (mapmate, recorder etc - can generate PID)
5) PDA/data capture in the field (more likely as part of systematic survey rather than personal, but still generates useful records)

Probably others as well, but that's a few to get me thinking.

1, 2 and 4 should be relatively easy to generate a PID - the problem with 4 is that the value of this information might not be understood and might get lost before it has a chance to find its way into the system.

5 with the appropriate expert knowledge a system could be in place to generate a PID on the fly - but it's possible (and perhaps more likely) that this ends up in a spreadsheet.

So, my question - I receive a spreadsheet with no PID which is in effect the master copy. It's relatively simple to plonk this into Indicia of course but then do I generate the PID attach it to a spreadsheet and send it back for the users' reference and tell them to enter records online from now on (that'll work...) or is it better to accept that the PID starting point is when the record actually enters the world of online biodiversity informatics? (I guess it would be fair to assume that the reason for passing the record on in the first place was to allow for a wider accessibility to the records)

Sorry for the ramblings... Hope some of the above makes sense

Ta

Iain

9

Re: Unique ID

Hi Daniel

The PDF from the following link makes interesting reading, apologies if you've already seen it

http://imsgbif.gbif.org/CMS_ORC/?doc_id … download=1

I was thinking the external key in occurrences might well be the place to actually store a persistent ID

Iain

10

Re: Unique ID

Hi Iain,

I did not know the pdf, I will read it soon.

The reason why we can't store the uuid in the external_id field is that we use this field for our definition of external_id:

We are going to accumulate several smaller databases in our warehouse, each of them might have an own concept on what is the primary id. So we store a prefix + their primary id. With this entry we haven an link to the original record.

We want to attach to every record our id, which is a uuid. This is the primary key we give out if we share our data with others. Right now we haven an additional attribute to sample and occurrence. But storing a uuid as a String is rather infective since it is just 128bit. Also queries against the additional attribute tables are extra joins. Since we display the uuids with every record this is also relevant for us.

We came to use uuid and not "site prefix" + occurrences.id because one site where we have to export data to uses the uuid as their natural primary key, so it was the easiest.

Daniel

11

Re: Unique ID

Hi

I noticed the other day that I couldn't see a response to Iain's post (currently number 8) which I thought I had written. Now I cannot see the post from Daniel to which Iain is replying in the current post number 9. Am I delusional or does anyone else feel that something has gone missing?

Exactly as Daniel describes, I expect to have recorders working in MapMate and syncing with Indicia and the occurrence external_id is where I would be keeping the MapMate id.

Jim Bacon.

12

Re: Unique ID

Hi again,

I have been pondering some more and realise that I have two questions on my mind that are causing me to reconsider my previous post. I haven't read the document that Iain has pointed out, so apologies if what I am about to write is ill informed.

My two questions are
1. Can I assume that records I receive in to Indicia from an external database will arrive with a UUID?
2. If I pass on records from Indicia that came from an external database should I do so with their original UUID or a new one?

Related to this is the question of whether the sync is  private and controllable or public and not controllable.

If I receive records with a key that is not guaranteed unique I have to store the originating key to permit future synchronisation but add a UUID for onward sync. The external key is the field where I expect to store the originating key. To avoid duplicates you would prefer the source not to sync with any other system on the same 'network'.

If I receive records with a UUID then, if I use that rather than adding my own, then the originator may sync with various organisations in the 'network' without duplication and it is far more easily traceable. If this key were stored in the external key field I would then need a flag to indicate it is unique. Alternatively I would need another field for it. Either way, at point of sync, something or someone has to tell Indicia that the incoming keys are unique.

MapMate records come with a key that is unique amongst all MapMate users so I would rather like to pass it on as the UUID if that were possible.

Jim Bacon.

13

Re: Unique ID

Jim Bacon wrote:

1. Can I assume that records I receive in to Indicia from an external database will arrive with a UUID?

No you can't, since not all databases have this kind of identifier. But we expect databases who want to share records with us to have this. When we migrate a database to ours which do not have an UUID we will assign them.

Jim Bacon wrote:

2. If I pass on records from Indicia that came from an external database should I do so with their original UUID or a new one?

You should keep the original one and use it. One uuid identifies one record, even if the record is duplicated over more than one database.

Jim Bacon wrote:

If I receive records with a key that is not guaranteed unique I have to store the originating key to permit future synchronisation but add a UUID for onward sync. The external key is the field where I expect to store the originating key. To avoid duplicates you would prefer the source not to sync with any other system on the same 'network'.

Yes.

Jim Bacon wrote:

If I receive records with a UUID then, if I use that rather than adding my own, then the originator may sync with various organisations in the 'network' without duplication and it is far more easily traceable. If this key were stored in the external key field I would then need a flag to indicate it is unique. Alternatively I would need another field for it. Either way, at point of sync, something or someone has to tell Indicia that the incoming keys are unique.

I would like to use the external_key field for the primary key of the old database, to be able to track the history of a record. The uuid is then the "new" id to the record, which is used identifying it. I could also use some thing like "indicia.flora-mv.de/sample/123456" but the uuid is desinged to be a unique identifier. Its much smaller (128bit) and well defined.

Jim Bacon wrote:

MapMate records come with a key that is unique amongst all MapMate users so I would rather like to pass it on as the UUID if that were possible.

How are these MapMate keys are build? I could not find something about them in the net.

Daniel

14

Re: Unique ID

Hi Jim
If we have a UUID field for occurrences then we can:
a) populate it on creation of a new record. For this to be a valid UUID, the algorithm used for generation should more or less guarantee it is globally unique (hence a UUID).
b) When we enable UUIDs on the Indicia warehouse update all existing records with one.
c) allow setup of synching between Indicia warehouses and other "warehouses" which support generation of a UUID, such as MapMate or Recorder. In these cases, the UUID field is populated by the UUID provided by the other system. If a record is received with a UUID matching a record in our database, then we update rather than insert.
d) We probably need to store the "source" system so that a UUID generated by one does not clash with a UUID generated by another (though in the case at hand, a 128bit UUID will never clash with a 16 char Recorder key or a MapMate key so that might not be necessary).

If we follow these rules then we can allow free flow of records without fear of duplication. It gets more complex if we are trying to allow warehouses without their own UUID support to sync. I think this is a use case we can safely just not support.

Regards
John

John van Breda
Biodiverse IT

15

Re: Unique ID

Hi

If we are only going to support synchronisation with systems that come with a unique ID and if we are going to forward that unique ID then we could just use the external key field. It is defined as character varying (50) so could accommodate keys from Recorder (16 characters),  MapMate (10 characters in the records table, see http://www.mapmate.co.uk/downloads/DataModel2.zip) or a 128-bit UUID stored as a 32 character hexadecimal value. And Daniel is already using this field for the originating key of the databases from which he is importing.

So, I'm now thinking, Daniel, that your need for an additional UUID field exists because you do not want to use the original ID of your records, contrary to the idea of keeping an original ID. And you want to use a UUID because that makes your onward synchronisation easier.

A new thought to me but obvious to anyone who has been through this before. To perform cross-system synchronisation you either need to use the same type of ID at both ends or you need to be able to store one anothers UUIDs. By way of example. I could upload MapMate records to Indicia, edit them on either Indicia or MapMate and sync the changes. However, if I create a new record in Indicia, with an Indicia ID, I can't get it in to MapMate unless MapMate is modified to work with an Indicia ID (whatever it is) or Indicia is able to attach MapMate type IDs to its records (which might not be such a bad idea, not that I have any idea whether the people at MapMate would support that at all).

Taking that a step further, you could have, perhaps at the user level, perhaps at the survey level, an option to set your external key format so that Indicia creates unique IDs in a format that matches the system you want to sync with. That might ease the burden on other systems we would like to inter-operate with.

Jim Bacon

16 (edited by iain 02-07-2013 12:01:52)

Re: Unique ID

Hi

There does appear to be a couple of posts missing in the middle of this trail... I had seen them both, but neither are showing up now for whatever reason.

A lot of  really useful information in this thread, and with a wee bit more digging on the web I've come up with these links which seem to suggest some sort of globally recognised definition for such a key (at least until something better comes along). I would suggest that something like this would go into the external_key field and with the creation of a field for the recognised key of an existing system (MapMate, Recorder... and so on). Looking at the example given there may be a solution by inserting the unique occurrence from an external system as a catalog number directly into the UUID. Mind you, I'm not sure how the various institution codes are agreed upon, but I can't immediately see how anyone could claim this as unique unless there was an agreed standard for all of this, which may well be the sticking point...

http://rs.tdwg.org/dwc/terms/index.htm#occurrenceID
http://terms.gbif.org/wiki/dwc:occurrenceID

Iain

17

Re: Unique ID

Hi

The forum administrators have been told about the missing messages.

I note, Iain, that the Life Science Identifier (LSID) in Darwin Core is specified "in the absence of a bona fide global unique identifier" and that GUID is what Daniel is proposing to add.

If there are to be two fields, one for original ID and one for GUID, we just need to be agreed which one goes in the field currently called external_key. Existing precedents are for original ID to go in the external_key. It would seem to me that two fields are only necessary if we think that records from other databases that supply records do not have a GUID.

If there is a concern that the IDs from MapMate, Recorder and Daniel's source databases cannot be considered globally unique or don't adequately identify the original record then an LSID seems to me a more transparent way of identifying them than creating a 128-bit UUID. Perhaps that is something that one configures when establishing a synchronisation process, allowing Daniel to have his UUID if that is what is best for him.

As far as agreeing institution codes is concerned, I see that GBIF has a register of organisations (see, for example, the NBN entry). There is also a Registry of Biological Repositories.

This has me thinking about the longevity of records. An individual using MapMate could sync records with Indicia which might go on to the NBN which might go on to GBIF and have a LSID that refers back to the original MapMate database. (I keep using MapMate as my example as it is currently popular but it could be anything.) In 20 years time, where will that recorder and their MapMate database be? Is it better for the LSID to refer back to an institutional database that might be longer lived? (Although who can be sure of that when NERC is considering privatising CEH.)


Jim.