Sunday, March 25, 2012

Basic advice on good practise required

I have master tables that I will be updating from our ERP system. Some examples I have seen take an approach of dropping a table in SQL server then creating it again before importing; some, and probably my choice, append and update; I have not seen an example where records are all deleted then the data appended afterwards. Of the three approaches which is generally regarded as best practise / most efficient?

The create and re-load versus the delete and then append are really the same thing from a process perspective, but they will have different performance due to the overhead of data change in SQL Server. Delete is a logged operation, so incurs overhead by writing the transaction log each row that is deleted. Dropping the table is just much faster. Truncating the table would be similar to the drop. It is an non-logged, non-selective delete in effect.

As to which is faster of the two methods, re-load or partial load and update, it depends on lots of factors. The speed of extract will be constant since both methods require you to extract all data, so it will come down to the cost of inserting lots of data, lots of write activity versus the cost of comparison in trying to reduce that write to only the changes required. I cannot give answers as things like row overall row width, comparison columns and how much data needs to be moved around to do each method will impact this. Here is an article that looks at two methods of comparing tables.

Get all from Table A that isn't in Table B
(http://www.sqlis.com/311.aspx)

Another option is to just insert all data, and capture failures due to a primary key violation. Then for the failures check or updates. I did some tests of a similar approach for patching bad data, and it was generally quicker to let it fail the insert and fix afterwards that check and potentially fix all rows. This will depend on the ratio of good to bad, or in your case, new rows, old rows, old that need to be updated.

If the cost of updates is high, then you will want to check if an update is really required, or has nothing changed. The Checksum Transform can be useful for this, as comparing an integer is faster than moving and comparing 200 columns of data to detect and change. Again it depends on your data.

Checksum Transformation
(http://www.sqlis.com/21.aspx)

The thing you need to work out is where is the bottle neck in your process. It may be that the extract query or network transfer is the slowest part and it does not matter which method you use for downstream loading.

My opinion is always go for deltas if you can, but if not then delete (drop/truncate) and reload all. It can be harder with foreign keys but it is going to be faster I think.

sql

No comments:

Post a Comment