Optimal configurations for bulk import for a batch of 100000 to 1000000 documents #432

Sandy247 · 2020-12-31T11:56:20Z

Could anyone help me with the optimal configurations for the connector while inserting documents from dataframes of sizes ranging from 100000 to 1000000. The spark cluster can autoscale to 20 worker nodes and the cosmos DB database can autoscale to about 4000 RUs(shared throughput between containers although the container that this data is going to is the hottest). Also, what settings for throughput for cosmos and node count for spark would you recommend if I would want to insert 1000000 records in around 5-10 min. Appreciate any help on this. Thanks.

anuthereaper · 2021-04-06T05:47:44Z

What is the size of each document? Is this a one time process or daily process?
Before you start you should probably read the article if you haven't already https://docs.microsoft.com/en-us/azure/cosmos-db/optimize-cost-reads-writes#optimizing-writes
A few things which are important :

Size of each document
The choice of partition key
Indexing
1000000 is 10 mins roughly comes to 1667 docs per second. I will assume the size of each doc is <1kb. The Microsoft docs tell us that it takes around 5.5 RU/s to write a 1kb doc with no indexing. If you need to fetch the data later you would probably want indexing on but set this only for specific columns. By default Cosmos indexes all columns. But even if you take 5.5 RU/s you will need 1667 X 5.5 RU/s which is around 9K RU/s. If your docs are more than 1kb in size and/or if you are using indexing you will need more than this.
When adjusting the RU/s I would suggest you keep checking the Metrics to see if you are hitting the peaks and adjust accordingly.
Assuming this is a one time or batch process, bring the RU/s back down to avoid extra costs.
The speed of ingestion also depends how many physical partitions you have and if you set the RU/s to more than 10K, cosmos will automatically start to split the partitions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal configurations for bulk import for a batch of 100000 to 1000000 documents #432

Optimal configurations for bulk import for a batch of 100000 to 1000000 documents #432

Sandy247 commented Dec 31, 2020

anuthereaper commented Apr 6, 2021

Optimal configurations for bulk import for a batch of 100000 to 1000000 documents #432

Optimal configurations for bulk import for a batch of 100000 to 1000000 documents #432

Comments

Sandy247 commented Dec 31, 2020

anuthereaper commented Apr 6, 2021