Wednesday, March 04, 2009

Improve Performance of ETL / SSIS Load Package using "Chunking" method

I'm reading: Improve Performance of ETL / SSIS Load Package using "Chunking" methodTweet this !

In a data warehouse environment, when changes / delta is being loaded to data warehouse, especially during the very first load when almost all the records from OLTP system are brought to the data warehouse for the ver first population, the time taken to populate it is generally high. Also in Retail and CRM applications which operate round the clock and across the globe, deltas also remain quite huge. If the OLTP system is quite old and number of records to be loaded are in units of million, performance becomes one important point.

Instead of executing a very large operation in a single statement call, you can break the modification into smaller pieces, potentially increasing performance and improving database concurrency for larger, frequently accessed tables. This technique is often used for large data loads to reporting or data warehouse applications.

Large, single-set updates can cause the database transaction log to grow considerably. When processing in chunks, each chunk is committed after completion, allowing SQL Server to potentially reuse that transaction log space. In addition to transaction log space, on a very large data update, if the query must be cancelled, you may have to wait a long time while the transaction rolls back. With smaller chunks, you can continue with your update more quickly. Also, chunking allows more concurrency against the modified table, allowing user queries to jump in, instead of waiting several minutes for a large modification to complete.

This "chunking" method can be used with INSERTs, DELETEs and UPDATEs too, along with SELECT. The expanded functionality of TOP (beyond just SELECT) adds a new technique for managing large data modifications against a table. By reducing the size of large modifications, you can improve database concurrency by reducing the time that locks are held during the operation (leaving small windows for other sessions), and also help manage the size of the transaction log (more commits, instead of one single commit for a gigantic transaction).

Max Commit and Batch Size property implements the above theory. As per some of the SSIS best practices recommended by SQL CAT Team to improve SSIS Performance Tuning and Optimization:

  • Commit size 0 is fastest on heap bulk targets, because only one transaction is committed. If you cannot use 0, use the highest possible value of commit size to reduce the overhead of multiple-batch writing. Commit size = 0 is a bad idea if inserting into a Btree – because all incoming rows must be sorted at once into the target Btree—and if your memory is limited, you are likely to spill. Batchsize = 0 is ideal for inserting into a heap. For an indexed destination, I recommend testing between 100,000 and 1,000,000 as batch size.
  • Use a commit size of less than 5000 to avoid lock escalation when inserting; note that in SQL Server 2008 you can now enable/disable lock escalation at the object level, but use this wisely.
  • Note that a commit size value of 0 might cause the running package to stop responding if the OLE DB destination and another data flow component are updating the same source table. To prevent the package from stopping, set the Maximum insert commit size option to 2147483647.

No comments:

Related Posts with Thumbnails