1

Topic: Problem importing dataset

I am having a strange problem trying to import a large dataset:

* The dataset is a TAB delimited text file containing 83,446 records
* on importing the data, after loading the data, Recorder reports (in bottom left corner) 41250 records
* I tried reducing the dataset to exactly 40,000 records and Recorder reported c34000 so it isn't a size issue
* on checking the data in the import window, there are random batches of records missing (there is a unique incrementing ID, so easy to see the missing batches, e.g. record nos. 103-125 are missing, I've examined these records in the original text file with an editor which can display hidden characters and can see nothing strange with the records, and the break is in the middle of a batch for a sample
* Trying to import the first 200 records works fine, as do first 1000, first 10000, and first 20000, but importing first 30000 only imports 29999, randomly missing record number 19611 and as previously mentions the first 40000 only imports 34569

Using version 6.24.0.280 (Dict 3B) on Windows 7

Any ideas, I can send the file, but obviously it is very large (didn't want to put it on the forum!)

Craig

Craig Slawson
Staffordshire Ecological Record

2

Re: Problem importing dataset

Have you tried converting it to CSV using e.g. Excel and importing that? (Or even just converting it to XLS?). I find Excels CSV output the most reliable.

Is it a MapMate TAB delimited file?

Charlie Barnes
Information Officer
Greater Lincolnshire Nature Partnership

3

Re: Problem importing dataset

Can't convert to XLS - too many rows, tried XLSX and get an immediate Recorder crash on trying to import the data, but I don't think this is caused by the data, I find it almost impossible to import large spreadsheets since the last update of Recorder, and have posted this to the forum previously

CSV is a problem, because the comments field contains quotes, double quotes and commas, so can't delimit the fields properly

The file originates from an export from a native Access table where I processed the data (after almost totally corrupting it in Excel!)

Craig Slawson
Staffordshire Ecological Record

4

Re: Problem importing dataset

Have you tried splitting the TAB file in half and using XLS?

Happy to have a look at if you want and try importing it here - if you ZIP it up it should email okay: charlie.barnes@glnp.org.uk

Charlie Barnes
Information Officer
Greater Lincolnshire Nature Partnership

5

Re: Problem importing dataset

If someone has a large problem file  which they can send me on a memory stick or upload to dropbox I will take a look.  The import was tested on some very large files and in all formats,  but there could be combinations of things which cause problems which no amount of testing will find.

Mike Weideli

6

Re: Problem importing dataset

Mike,

Zipped, it is less than 1mb, but I don't want to publish it to a public forum, so I'll e-mail it to you direct

Craig Slawson
Staffordshire Ecological Record

7

Re: Problem importing dataset

There is a corruption in the file provided, which is causing the problem.

The following method show that there is a corruption and this may be of use in similar situations.

To see that there are problems import the original data. Click on 'First Row Contains Column Name'. Then double click on  ID (second  column, second row). This will sort the entries by ID and you will see invalid data. Similarly sorting on other columns can show invalid data (eg species name).

Mike Weideli

8

Re: Problem importing dataset

Thanks Mike,

I have located the problem - in the original Access table there were carriage returns in the comments field, a 'copy' out of the Access table instead of 'Export' showed these up clearly - after removing the carriage returns, the data imports in full, but weird, that the error was so much more serious than the original problem, there were less than a dozen records with carriage returns, but it stopped over 40,000 records going in!

Cheers
Craig

Craig Slawson
Staffordshire Ecological Record

9

Re: Problem importing dataset

With extra carriage returns the temporary import file will contain rows which do not have all the columns populated with some that are poulated having corrupt data. The import wizard grid appears to be populated in blocks using Sql which makes assumptions about the contents of the columns, so if the sql fails because some of the records in the block are invalid then the  data is not displayed for this block. What I am unsure of is how the process  is identifying the records for each block. Clearly  not based on a numeric sort. I expect it would be possible by analysing the missing records or by looking in the R6 code to identify exactly what is happening, but  interesting though this would be, it would only be worth doing as part of a modification  to improve the process.

Mike Weideli

10

Re: Problem importing dataset

Mike,

I've done a little further checking of the original data, and the Access table was definitely corrupted, there weren't any carriage returns, but there were some unprintable characters which were messing everything up - a copy translated them to carriage returns, but an Export didn't!

Craig

Craig Slawson
Staffordshire Ecological Record