Data Chunking in salesforce for large volume data extracts

Basically Chunking can be done in two ways:

  1. Auto-number Chunking
  2. PK chunking

Auto-number Chunking

Using a date range to group data into sets can be easy and clean, but the total number of records that could fall within a given range might go above the Force.com query optimizer’s selectivity threshold. When this happens, the underlying query must do a full object scan, often leading to timeouts. (Refer to the Database Query & Search Optimization Cheat Sheet to see what the selectivity thresholds are.) If you are certain that filtering by the date range alone will always be selective, you can use this field to chunk your data. Otherwise, using a unique ID is more reliable, and can be easily implemented by following these steps.
  1. Create or use an existing auto-number field. Alternatively, you can use any number fields that can make up a unique value, as long as they are sequential.
    Note: If you are adding a new auto-number field, make sure to enable the “Generate Auto Number for existing records…” option.
  2. Create a formula field that converts the auto-number field text value into a numeric value—you cannot use an index with comparison operators such as “<=” (less than or equal to) or “>” (greater than) for the text-based auto-number field. In this example, we’ll name this field “ExportID.”
  3. Place a custom index on the formula field by contacting salesforce.com Customer Support.
    Hint: Depending on the characteristics and distribution of your data, adding another selective filter, such as “CreatedDate,” and placing a compound (or two-column) index might help group the data sets in a more meaningful, manageable way.
  1. Run pre-queries to determine the boundaries. For example:
  • To find the lowest boundary:
Select ExportID__c From YourObject__c order by ExportID__c asc null last limit 1;
  • To find the highest boundary:
Select ExportID__c From YourObject__c  order by ExportID__c desc null last limit 1;
Hint: Although you should have no records with the “ExportID__c” field empty (i.e., null), look for them before running your queries so that you don’t miss them before the extraction.
  1. Add a range filter to your extraction (i.e., a WHERE clause in your SOQL) to limit the number of targeted rows so that they are below the selectivity threshold. For example:
SELECT Field1__cField2__c, [Field3__c …]
FROM Account
WHERE  ExportID__c > 1000000
AND ExportID__c <= 1200000

When submitting the subsequent requests to retrieve the data, simply replace the “ExportID__c” values by incrementing the ranges until you have gone through all of your data. Things get a little more complicated if you mass delete records often. While the approach described above still works, the gaps created by deleting records reduces the number of records included in each range, making it less optimal from a performance perspective. In other words, you could have a “chunk” that returns only a few or even no rows. There is a solution to this problem, which I will cover in a later Technical Enablement post.

Ref: https://developer.salesforce.com/blogs/engineering/2013/06/extracting-large-data-volume-ldv-in-force-com.html

PK Chunking to the rescue!

Some of our larger enterprise customers have recently been using a strategy we call PK Chunking to handle large data set extracts. PK stands for Primary Key — the object’s record ID — which is always indexed. With this method, customers first query the target table to identify a number of chunks of records with sequential IDs. They then submit separate queries to extract the data in each chunk, and finally combine the results.
With the arrival of the Spring ’15 release, we’re pleased to announce that PK Chunking is a supported feature of the Salesforce Bulk API. Now you can get the performance benefits of PK Chunking without doing all the work of splitting the queries into manageable chunks. You can simply enter a few parameters on your Bulk API job, and the platform will automatically split the query into separate chunks, execute a query for each chunk and return the data. Here are the basics:
  • You can use PK Chunking with most standard objects and all custom objects.
  • To enable the feature you specify the header ‘Sforce-Enable-PKChunking‘ on the job request for your Bulk API query.
  • By default the Bulk API will split the query into 100,000 record chunks – you can use the ‘chunkSize‘ header field to configure smaller chunks or larger ones up to 250,000. Larger chunk sizes will use up fewer Bulk API batches, but may not perform as well. For each object you are extracting, you might need to experiment a bit to determine the optimal chunk size.
  • You can perform filtering while using PK Chunking by simply including a WHERE clause in the Bulk API query. In this case, there may be fewer records returned for a chunk than the number you have specified in ‘chunkSize‘.
  • If an object is supported, you can also use PK Chunking to query the object’s sharing table. In this case, determining the chunks is more efficient if the boundaries are defined on the parent object record IDs, rather than the share table record IDs. To take advantage of this, you should set the value of the Parent header field to the name of the parent object. For example, when querying OpportunityShare, set Parent to Opportunity.

Ref: https://developer.salesforce.com/blogs/engineering/2015/03/use-pk-chunking-extract-large-data-sets-salesforce.html

Examples

Suppose a customer is using the custom object MessageStatus__c to keep track of a high volume of phone calls, emails, and other communications. They want to perform a complete extract and limit the number of chunks to make consolidating the data easier. They can perform a Bulk API query on MessageStatus with this header:
Sforce-Enable-PKChunking: chunkSize=250000;
Another customer is planning a security audit and wants to identify all the manual shares that exist on their Account records. To execute this, they can perform a bulk query on AccountShare, using the filter WHERE rowCause=Manual, with a header like this:
Sforce-Enable-PKChunking: chunkSize=250000; parent=Account

Comments