Not as simple a task as one might imagine (as I'm sure you have worked out). The initial selection would be easy enough as you could simply query the sample table for items with nothing in the LOCATION_KEY field (in SQL Server tools).
I would have thought though that the harder task would be to check for whether the sample spatial ref falls entirely within a polygon. A potentially very long piece of GIS data crunching, beginning with creation of a layer of all the records with square polygons reflecting their spatial ref precision, then performing "entirely within" query, using the results of that query to populate the sample location key in SQL Server tools again. What GIS system do you use?
I would have to ask what you hope to gain by it and whether you think it is really worthwhile. Might it not create a false sense of the reliability of the data? On our system for example we have hundreds of thousands of tetrad records which would not be attributable to any individual site (except in rare circumstances) even though they were in many cases recorded on accessible areas of species-rich land which we recognise as sites in our GIS layers, so if we wanted to query all of the records for a given site and be sure of getting them all, we still need to query all records which overlap and are thus potentially within the site. The situation is further complicated when site boundaries are edited from time to time. If a site is extended you would then have to requery and add any which fall within the new boundary.
I have given this a lot of thought over the years and always end up deciding that the only time we will associate an occurrence with a location, is when we can be sure that the recorder was familiar with the currently recognised boundary of the site and explicitly stated that it was within the site. Consider the situation where a recorder makes a minor error in the grid reference, giving a value which is inside a site rather than just outside. The spatial error is relatively inconsequential, moving the record by a few metres perhaps, but such post processing as you propose creates a larger conceptual error by positively identifying a record as falling within the site, when it in fact does not.
I suppose, given that you work for NT it is possible that your dataset only contains data which were recorded within your sites though, which changes things somewhat.
Not sure if that helps you all that much though.
Rob Large
Wildlife Sites Officer
Wiltshire & Swindon Biological Records Centre