Topic: SQL to mark up duplicate records


It is NBN upload time again. This is the time when the number of duplicate records in our database really becomes apparent. Not just identical duplicates either but the same record with different recorders and grid reference precisions upon closer inspection i.e.

Soprano Roost, SD 123 456, 05/07/2015, ABC123 Consultancy, "Roost in west face of building under gutter"
Soprano Roost, SD 1239 4561, July 2015, John Smith, "Roost in west face of building under gutter - roost retained"

In the above case we would like to retain the second record and mark the first as a duplicate so it doesn't keep reappearing in reports. We want to retain it rather than delete it for several reasons including

* keeping the first dataset intact
* so we don't get the deleted record returned to us
* if a data exchange agreement terminates and we have to delete the dataset containing the second record we still have access to the first version obtained via a different source

Does anybody have an example query which would do the following?

i.e. UPDATE occurrence_determination SET determination = "duplicate record", determination comment = "<insert original occurence id>" WHERE taxon_occurrence.id = "<insert duplicate occurence id>"

Thanks for any ideas in advance. Although I'm fairly seasoned in SQL the recorder6 data model scares me a little.


Re: SQL to mark up duplicate records

What is difficult is working out where best to hold the information that the record is a duplicate. The usual way to do is to use the Determination type, and this is fine, and allows you to set the Verified indicator to  '1' preventing export of the records or appearing in reports. This is by far the easiest method and works fine  if  you don't ever expect to use the record again.  However, if you envisage a time when you might wish to restore the record then  this method will result in the loss of the original determination type.

If you don't use Surveyors ref then this field could be used.

If you do use it  then you are left with the option of adding something to the record which can be deleted if the record is needed again. The obvious candidate for this is a Taxon_Occurrence Measurement.  Marking records manually with a Measurement  is fairly simple, but automating the process - say from a list of duplicates - requires a SQL stores procedure query as new Taxon_Occurrence_Data keys would need to be created.

Another option is to create  a new preferred determination with a determination type of 'Duplicate'. If you ever need to use the record again then the original determination can be made the preferred once again.  As with the Measurement this change requires the creation of new keys - in this case for a new Taxon_Determination.

Let me know what you think about these ideas and I will give it some further thought.

Mike Weideli

3 (edited by charliebarnes 16-03-2016 16:31:42)

Re: SQL to mark up duplicate records

I've done something similar which uses a concatenation of taxon-1km grid ref-date to identify duplicates, but that wouldn't pick up your example!

I basically create a temp table with the taxon-1km grid ref-date combo and select all records that match that combo (then use the set of designation batch updates to add a designation of e.g. duplicate).

Charlie Barnes
Information Officer
Greater Lincolnshire Nature Partnership


Re: SQL to mark up duplicate records

Hi Mike

Further to this subject I've run H1 and got 2000 or so duplicates in my database. Having had a look at a wide sample of these I'm pretty certain that they are true duplicates and would like to mark the duplicates as such with a new preferred determination with a determination type of 'Duplicate'. How could this be done - a batch update? And would this prevent duplicate records appearing in output reports?

Suppose I decided later that I wanted to delete the duplicates en masse. Could this be done by selecting all records with determination type "Duplicate" and running a batch update to delete them?

Bob Marsh
Doncaster LRC
Coleoptera recorder for YNU


Re: SQL to mark up duplicate records

H1 can generate a csv file, which could be used to create a new determination, It wouldn't take that long, to do the required Btach update, but I am struggling a bit with time at the moment so will not immediately offer to help. The Determination type of duplicate needs to have a Verified indicator of 1 which will stop the data appearing in reports. The  Batch update needs to bring the verified indicator on the occurrence into line with the Determination type and make sure that the new determination is made the preferred. Deleting the determination needs to reverse this by both deleting the determination, allocating a new preferred and bringing the taxon occurrence verified  indicator in line with this.  Allocating the preferred  may not be so easy if there are several determinatiosn as it will not know which one to use (could just make the latest on the preferred if this happens).

Mike Weideli