1

Re: A failsafe method to check for duplicate data entries - how?

Hi

I thought I was beginning to understand R6 and how it works, but my slight confidence has been somewhat shattered at a critical time (when I need to get sets of records - not in duplicate or worse - to people who will write species accounts for the next bird report).

The problem:

I have many, possibly now several thousand duplicate records (ie same species, same date, same location, same abundance, same observer). This has occurred despite my use of a system whereby I put new files for import in one folder and after import move them to a different folder.

I know now that I get some duplicate records because observers send records to several public databases that I incorporate in my R6 database (records directly to me, or via BBS, Birdtrack, etc etc). Also, I have not found a foolproof method of checking whether records have been successfully imported or not. 

The other day I found that the number of records in R6 for the year 2007 was about 24,000 but the number I expected from the number of records in Excel import files as around 32,000. I was unable to account for the difference and wanted to check the records in R6 to see what was missing. I used the Report Wizard and the newly acquired skill of selecting the vice counties, selecting a year, but have been unable to select 'Observer' from the list of variables to select on in 'Additional filters'. I have used 'Surname (Determiner)' and found that does not select all the records that a particular observer has had imported. An example was that one person for whom I had imported 3,862 records had only 204 identified by the Report Wizard.

I have, some time ago, requested that the import process be amended so that R6 checks for duplicate records between the file to be imported and existing data (not just duplicates within the import file which is what it seems to do currently). As far as I am aware there has been no progress with that feature.

Question:

Is there a method, eg an XML report, or some other add-on, that can have parameters set so that it can check for duplicates and weed them out? The only means I know of currently is to select the set of records for each date and delete each record, one at a time.

How do colleagues with much bigger sets of records than I have check to ensure they do not have duplicate records? How do you get rid of real duplicates?

I would be grateful for any suggestions!

Cheers, Ian

2

Re: A failsafe method to check for duplicate data entries - how?

Hi

I have been thinking further about how to check my database. One ideal way would be to see the number of records in the database by the 'Observer' - but I cannot find a way of doing that.

I do not put an entry in the 'Determiner' field in the import spreadsheet (though I may have to mend my ways about this) so if I do a search for all records in a year, with 'Obs Determiner' selected I now know that the report returns fewer records than are in the database. Ideally, I wish to check the number of records submitted by each Observer.

I thought of creating a Rucksack, as Charles had suggested for taxa recently, but I cannot find a place in the Report Wizard for opening a rucksack with names of observers in it.
With over 200 observers in a year, selecting each name at the start of the Wizard is not a viable option. The situation would be much easier if all the fields available for import were available for selection and viewing in the Report Wizard.

One way that may show all the records received in a year would be to leave out of the reporting 'Obs Determiner', but I would not have any indication of the observer for any of the records that show, so would not be helpful in trying to establish duplicate or missing records.

Does anybody know a way around this problem? Any suggestions welcome!

Thanks in anticipation, Ian

3

Re: A failsafe method to check for duplicate data entries - how?

Hi

I have done some further checking - 'Sample Observer' in the Report Wizard does not include any extra records. I know there are extra records that should be included in the year's sample as I have found that one site, Rhydymwyn, is excluded from any data selection despite having several thousand records against it. I don't understand why this site is not included as nearby sites, eg Rhydymwyn, N, Rhydymwyn,E are included.

My current record total for the year is 23,621 but that is probably at least 4,000 short of the actual total.

When I have been selecting variables to show in the Report Wizard I have chosen those that I know have been used. This means that many records have blank spaces (as expected - not all are firstwinter records) but I do not understand why the Wizard is omitting at least one place - and I don't know how many other places it is similarly omitting and do not know of a means of auditing what is present in the database.

I am really struggling with this now. Suggestions, please!

Cheers, Ian

4

Re: A failsafe method to check for duplicate data entries - how?

Hi Ian

I would suggest you have a word with Luck and Eric at rECOrd (01244 383749). They have a routine (originally written by one of rECOrd's Trustees (Vernon Hockley)) which will run through teh database identifying duplicate data.

The duplicate data issue (and the fact that the same data often arrives from different sources) is an issue which rECOrd faced a long-time ago. It was one of the main reasosns for writing our online data entry system (which allows online keying and/or the importation of spreadsheets) which checks data for duplication and formatting errors etc. as it is entered or imported.

Hope you find some help at rECOrd - tell them I sent you.

Cheers

Steve :)

Steve J. McWilliam
www.rECOrd-LRC.co.uk
www.stevemcwilliam.co.uk/guitar/

5

Re: A failsafe method to check for duplicate data entries - how?

The next version of Recorder (due any day now) includes a report that can identify duplicate data.

Charles Roper
Digital Development Manager | Field Studies Council
http://www.field-studies-council.org | https://twitter.com/charlesroper | https://twitter.com/fsc_digital

6

Re: A failsafe method to check for duplicate data entries - how?

Hi Steve

Thanks for your suggestion - I shall follow that up later today as the weather is so horrible I cannot go checking Barn Owl nests as planned. I hope retirement is suiting you well!

Hi Charles

I hope the new version does come out soon, but in another thread I read that you have a system for checking import files for duplicate records (ie with those already on your database). Is this system an add-on or integral to your computer system so cannot be shared?

I did manage to find why R6 was not selecting records from a particular place, last night. I checked the Location details and in the Administrative Areas tab the information about counties had been removed (not by me!). I then started to check all my other Locations and have found that some sites have had their county details removed and others have had them duplicated. I have been adding the details to those where they have been removed and editting the extra details. This is worrying, as R6 has done this on its own - with what frequency, under what conditions, etc? Who knows? Has this happened to other users? So, when I did a check of the database without requesting VC information my total of records for 2007 shot up from 23,621 to 30,513 - so there are probably several Locations that have had their county details removed.

All the best, Ian

7

Re: A failsafe method to check for duplicate data entries - how?

Ian, that does sound like a rather worrying bug. Hopefully Steve (or Sarah's successor) will pick it up and investigate, but I'd send them an email, just to be sure.

I don't have anything special to test for duplicates on import. Recorder simply checks to see if a key already exists when it imports data and then allows you to reconcile this. It doesn't check the actual data, though, it just checks the keys. If you're importing from a spreadsheet and that spreadsheet doesn't contain keys, then it can't check for duplicates.

Charles Roper
Digital Development Manager | Field Studies Council
http://www.field-studies-council.org | https://twitter.com/charlesroper | https://twitter.com/fsc_digital

8

Re: A failsafe method to check for duplicate data entries - how?

Hi Charles

Do you know how imminent the new version of R6 will be - days or months?

To whom should I send an email about the places problem? I have not noticed an announcement about a replacement for Sarah. When I went through all my locations today, I found that possibly as many as 20+ locations had had their Administrative area data removed. A similar number had their Admin Area info duplicated and in some cased triplicated. One location had Admin Area data  for another place added to its set (with one of the counties different - hence I could tell it was not just a duplication).

All the best, Ian

9

Re: A failsafe method to check for duplicate data entries - how?

Charles/Ian
I am keeping an eye on this. I would need a bit more to go on though to be able to replicate this - have you any idea how or when it is occurring? eg. were the data against the rogue locations being reimported or deleted or something?
Steve

10

Re: A failsafe method to check for duplicate data entries - how?

Hi Steve

Unfortunately I have no idea what may have caused the problem with the locations. About three months ago I completely reorganised my location hierarchy so that I could import tetrad data, but that just meant moving the location information from a logical position, eg 'Wrexham' to a grid ref based system, with 10 km squares, then tetrads, then 1 km squares within tetrads. At that time I had no reason to suspect that any Admin Area information would be corrupted so did not check. I realised there was a problem when preparing to send records to species account authors and found that one place was not selected by Vice County.

I have imported much data and at no time was there any different behaviour with the Import Wizard. Also, now that I have corrected the locations I think most, if not all, of the imported data are accounted for. I'm sorry that I can't think of anything that may have precipitated the errors to the Admin Areas. Have you heard of this happening to other users? Have any other users checked to see if they have such a problem? Is there a routine available in R6 to have an export to Excel or whatever with the location and all the Admin Areas assigned to them? Such a routine could be run from time to time to do such a check.

All the best, Ian

11

Re: A failsafe method to check for duplicate data entries - how?

Ian,

I haven't checked the administrative areas section but I do know that merging locations caused duplication in the grid squares section of the Geo Info tab, which may be the same problem as your duplications. This was mostly done in R2002, so I am not sure if it is still a problem in 6. Don't know about the missing areas though

Gordon

Gordon Barker
Biological Survey Data Manager
National Trust

12

Re: A failsafe method to check for duplicate data entries - how?

Hi Gordon

That's interesting - a pity that I did not look after I had merged my locations. However, at least one site that lost its Admin Areas has never been merged with any site.

Cheers, Ian