1

Re: Importing and duplicates

Hi

My systems have not been as good as I had hoped (or I have not used them properly) and I have imported a couple of data files twice. This has not been very bad, so far, as the files have not been huge, but it begs a question in my mind. What are the duplicates that the import process refers to (I cannot remember ever having any items indicated, though I have had invalid data items)?

I understood that duplicate items would be data that are the same in the R6 database and in the to-be-imported file (eg species, date, place, observer, count, etc). If all those are the same, the new data are duplicates. I thought I had better check the Help files (though these are for R2000) to see what they say and I don't understand what they are telling me. There is a red message "Detection of duplicates RELIES ENTIRELY ON THE GLOBALLY UNIQUE KEYS, the data are not checked at all". This seems to be a staggering way to set up a system (unless there is something important that I am missing here). How does R6 allocate globally unique keys if it pays no regard to the data that they are referring to? If each record in each to-be-imported file is given a unique key how could there ever be duplicates found - to do so it is essential to check the actual data.

At face value it seems that R6 does not detect duplicate data entries. Am I missing something, or is there something I should do to try to avoid such duplicates in the future?

I'm not sure if this is entirely related, but I have found that after import there are more than one reference to a location and date, for the same data set. I have used the Database Merge tool to combine them. My problem, above, is that I end up with two, identical data sets for species, location, date, count, etc. I do not wish to merge them, I wish to ensure the second copy is not imported. Does anyone know of a good way of checking the database for duplicate values so they can be marked for individual, or better still, block delete process?

Even the superb Falklands Island instructions don't help with this issue!

Any ideas please?

Thanks, Ian

2

Re: Importing and duplicates

Hi Ian

Duplicate records, as far as the import process is concerned, are explained as follows:
1) Each record, on entry, is allocated a unique NBN key. This key includes the site identifier so never clashes with another record in the same table.
2) If you export this record to other datasets via zipped Access or NBN data formats, the other dataset will also hold a copy of this same record.
3) If the record is imported into a dataset which already holds a copy of a record with the same key then it is identified as a duplicate. This could occur if you export the record to someone and they export it back to you, for example.

Recorder does not identify duplicates on import for 2 reasons, firstly that its not clear exactly what constitutes a duplicate record in terms of the data (John Smith could record a blackbird in Poole on a given day, if there are 2 records with this information how do you determine if its genuine or a duplicate?). The other reason is that scanning an entire import dataset for this sort of duplication would be quite intensive and would slow down imports significantly.

An option would be for an XML report to be written that lists, in order, potential duplicate records based on a simple rule (any occurrences of the same taxon at the same location on the same day, for example). It wouldn't be perfect but it might help.

Kind Regards

John van Breda
Biodiverse IT

3

Re: Importing and duplicates

If I've understood Jon's reply, then Ian's statement is correct in saying that (apart from the GUID) there are no duplicate data controls in Recorder?

Maybe I have misunderstood, but if the import has duplicate data the question of which is genuine does not arise. If they are both identical - either record would be valid, but not both. The question is - is the duplicate data contiguous in the input file? If so, it's easy to compare Record N to N+1.

I though the import data was transfered into a temp mdb file. Would this not be the place where duplicate data could be identified? This would not require pre-scanning of input files. The import process is quite slow anyway, so I'd be happy to slow it down a bit more to make sure duplication rules were applied.

Regards,

Dave Cope,
Biodiversity Technology Officer,
Biodiversity Information Service for Powys and Brecon Beacons National Park.

4

Re: Importing and duplicates

Dave, maybe I have it wrong but I think that the question would refer to the import of a record that is already in the database rather than 2 identical records in the same import.

In the second case you could do your own checking within the file to be imported, although sequential ordering could not be assumed it should be possible to rearrange the data to do this, but potentially a large job. In the first case it would be an immense job to check the records in the importing file against the Recorder database.

A tool like John mentions would presumably find duplicates that had arrived by either of these routes, or any other possible route, and would be extremely useful as I am sure our dataset contains duplicate records but do not know or where or how many there are. I can see a situation of regularly running it to ensure no duplication has crept in. Should this be put in as a feature request?

Gordon

Gordon Barker
Biological Survey Data Manager
National Trust

5

Re: Importing and duplicates

Hi John, Dave and Gordon

I like the idea of the XML report that John mentions - to identify any duplicates that have slipped in. I would prefer some way of identifying duplicates before the import is confirmed, but I can see that as the dataset in the database gets bigger the amount of checking could become huge. On the other hand, with all the dictionary information, I would have thought there could be a way to restrict the search to those items contained within the checklist, or dictionary, that is specified by the data to be imported.

If these are feature requests, then please put them on the list!

All the best, Ian

6

Re: Importing and duplicates

Gordon Barker wrote:

Dave, maybe I have it wrong but I think that the question would refer to the import of a record that is already in the database rather than 2 identical records in the same import.

In the second case you could do your own checking within the file to be imported, although sequential ordering could not be assumed it should be possible to rearrange the data to do this, but potentially a large job. In the first case it would be an immense job to check the records in the importing file against the Recorder database.

Hi Gordon,

In the case of duplicate data in the input, I agree that in most cases it can be seen and removed beforehand (using sotware tools) and this is what I check for with new data. Some formats make that simpler, e.g.flatform text file, spreadsheet etc. If it is in NBN zipped, then the work is a little harder as one has to open that up and join the tables to examine for duplicates.

Even if duplicates are removed from the input, then we could still have records in that data which are already in Recorder.
During the import process the import data is put into a temp DB. Unless I've got this all wrong, then at this point - given the underlying Recorder data model - the next step is to copy that data into the model. It is a at this stage that I would argue that the data could be checked for duplication.

A tool like John mentions would presumably find duplicates that had arrived by either of these routes, or any other possible route, and would be extremely useful as I am sure our dataset contains duplicate records but do not know or where or how many there are. I can see a situation of regularly running it to ensure no duplication has crept in. Should this be put in as a feature request?

Yes, a very usefull tool indeed. But rather than a report, it would be of more use if it actually took action to rectify the situation. Also, if it can find duplicate data that exists, then it could also find it before new data is appended. In a post import process what should be done with duplicates?

For two identical records that exist in seperate surveys the question is, which is the correct survey for that record? [1] Once a choce is made to keep the data in a particular survey, should the duplciate data to be removed and dumped out to a seperate DB.(i.e. NBN zipped)?

Cheers.

[1] I've assumed that the model, for the example here, starts at the survey event level.

Dave Cope,
Biodiversity Technology Officer,
Biodiversity Information Service for Powys and Brecon Beacons National Park.

7

Re: Importing and duplicates

Hi

Just to let you know I have made a note of the requests above.

At the moment we are focussed on testing the new installation CD's (version 6.10), but once these are released we plan to review the existing functionality within Recorder 6 and user requirements (particularly those collating and managing data) over the next few months (and beyond).

I'll continue to monitor this thread and take note of any further suggestions.

Many thanks for your input,

Sarah

Sarah Shaw
Biodiversity Information Assistant
JNCC

Sarah Shaw
Biodiversity Information Assistant
JNCC

8

Re: Importing and duplicates

Hi

I'm working through my duplicate data and deleting those that are lower down the list on the screen (so presumably entered later) and when I press the delete key to delete samples, I frequently get errors, as below:


Exception occurred in application Recorder 6 at 29/06/2007 08:49:36.
Version : 6.10.4.120

Exception path:
EAccessViolation : Access violation at address 00565FC6 in module 'RecorderApp.exe'. Read of address 00000061

Last event\actions:
  TfrmSampleDetails created
  TfrmSampleDetails destroyed
  TfrmEventDetails created
  TfrmEventDetails destroyed
  TfrmTaxonOccurrences created
  TfrmTaxonOccurrences destroyed
  TfrmEventDetails created
  TfrmEventDetails destroyed
  TfrmTaxonOccurrences created

Operating System : Windows XP  5.01.2600  Service Pack 2
Physical Memory available : 2,095,196 KB

DLLs loaded:
  advapi32.dll (5.1.2600.2180)
  comctl32.dll (5.82.2900.2982)
  comdlg32.dll (6.0.2900.2180)
  gdi32.dll (5.1.2600.3099)
  HHCtrl.ocx (5.2.3790.2847)
  kernel32.dll (5.1.2600.3119)
  mpr.dll (5.1.2600.2180)
  MS5.Dll (5.0.0.12)
  MS5User.Dll (5.0.0.4)
  odbc32.dll (3.525.1117.0)
  ole32.dll (5.1.2600.2726)
  oleaut32.dll (5.1.2600.2180)
  shell32.dll (6.0.2900.3051)
  user32.dll (5.1.2600.3099)
  version.dll (5.1.2600.2180)
  winmm.dll (5.1.2600.2180)
  winspool.drv (5.1.2600.2180)

Information has been saved to the file C:\Program Files\Recorder 6\LastError.txt


I don't understand all that is said here, but wonder why R6 produces errors when a delete is required. Is this okay? Or is there a glitch that needs to be sorted?

Cheers, Ian

9

Re: Importing and duplicates

Hi Ian

An unhandled error does occur (not every time but quite frequently) when deleting items such as a sample or taxon occurrence in the Observation Hierarchy.

This is a known bug that we're planning to fix in the next version.

Best wishes,

Sarah

Sarah Shaw
Biodiversity Information Assistant
JNCC

Sarah Shaw
Biodiversity Information Assistant
JNCC