You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could anyone help me with the optimal configurations for the connector while inserting documents from dataframes of sizes ranging from 100000 to 1000000. The spark cluster can autoscale to 20 worker nodes and the cosmos DB database can autoscale to about 4000 RUs(shared throughput between containers although the container that this data is going to is the hottest). Also, what settings for throughput for cosmos and node count for spark would you recommend if I would want to insert 1000000 records in around 5-10 min. Appreciate any help on this. Thanks.
The text was updated successfully, but these errors were encountered:
Indexing
1000000 is 10 mins roughly comes to 1667 docs per second. I will assume the size of each doc is <1kb. The Microsoft docs tell us that it takes around 5.5 RU/s to write a 1kb doc with no indexing. If you need to fetch the data later you would probably want indexing on but set this only for specific columns. By default Cosmos indexes all columns. But even if you take 5.5 RU/s you will need 1667 X 5.5 RU/s which is around 9K RU/s. If your docs are more than 1kb in size and/or if you are using indexing you will need more than this.
When adjusting the RU/s I would suggest you keep checking the Metrics to see if you are hitting the peaks and adjust accordingly.
Assuming this is a one time or batch process, bring the RU/s back down to avoid extra costs.
The speed of ingestion also depends how many physical partitions you have and if you set the RU/s to more than 10K, cosmos will automatically start to split the partitions.
Could anyone help me with the optimal configurations for the connector while inserting documents from dataframes of sizes ranging from 100000 to 1000000. The spark cluster can autoscale to 20 worker nodes and the cosmos DB database can autoscale to about 4000 RUs(shared throughput between containers although the container that this data is going to is the hottest). Also, what settings for throughput for cosmos and node count for spark would you recommend if I would want to insert 1000000 records in around 5-10 min. Appreciate any help on this. Thanks.
The text was updated successfully, but these errors were encountered: