Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal configurations for bulk import for a batch of 100000 to 1000000 documents #432

Open
Sandy247 opened this issue Dec 31, 2020 · 1 comment

Comments

@Sandy247
Copy link

Could anyone help me with the optimal configurations for the connector while inserting documents from dataframes of sizes ranging from 100000 to 1000000. The spark cluster can autoscale to 20 worker nodes and the cosmos DB database can autoscale to about 4000 RUs(shared throughput between containers although the container that this data is going to is the hottest). Also, what settings for throughput for cosmos and node count for spark would you recommend if I would want to insert 1000000 records in around 5-10 min. Appreciate any help on this. Thanks.

@anuthereaper
Copy link

What is the size of each document? Is this a one time process or daily process?
Before you start you should probably read the article if you haven't already https://docs.microsoft.com/en-us/azure/cosmos-db/optimize-cost-reads-writes#optimizing-writes
A few things which are important :

  1. Size of each document
  2. The choice of partition key
  3. Indexing
    1000000 is 10 mins roughly comes to 1667 docs per second. I will assume the size of each doc is <1kb. The Microsoft docs tell us that it takes around 5.5 RU/s to write a 1kb doc with no indexing. If you need to fetch the data later you would probably want indexing on but set this only for specific columns. By default Cosmos indexes all columns. But even if you take 5.5 RU/s you will need 1667 X 5.5 RU/s which is around 9K RU/s. If your docs are more than 1kb in size and/or if you are using indexing you will need more than this.
    When adjusting the RU/s I would suggest you keep checking the Metrics to see if you are hitting the peaks and adjust accordingly.
    Assuming this is a one time or batch process, bring the RU/s back down to avoid extra costs.
    The speed of ingestion also depends how many physical partitions you have and if you set the RU/s to more than 10K, cosmos will automatically start to split the partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants