Performance issues observed after updating from azure-cosmosdb-spark_2.4.0_2.11-2.0.1 to azure-cosmosdb-spark_2.4.0_2.11-3.70 #469

manums · 2022-03-03T07:35:06Z

Language - Scala
Compute - HD Insight
Spark Version - 2.4
Bulk Import - True
Connection mode - DirectHttps
WriteBatchSize - 3000

Issue

In our product, we use cosmos db connector to save dataframe to cosmos db. We were using 2.0.1 version for a year or more and we had run into couple of issues with missing data post save operation. We were suggested to move to latest version (3.7.0) which will fix bulk ingestion issue and handles transient errors from cosmos db.

We upgraded connector package in our code from azure-cosmosdb-spark_2.4.0_2.11-2.0.1 to azure-cosmosdb-spark_2.4.0_2.11-3.7.0. These changes went live from Feb 1st 2022.

We did not do any code change (except changing version number) when we migrated from 2.4.0_2.11-2.0.1 to 2.4.0_2.11-3.7.0.

Our spark jobs run on HD Insights cluster and that limits us from moving to Spark 3.0.

We are seeing increased latency with CosmosDbSpark.Save call for similar sized dataframe post upgrade.
RUs provisioned and passed in as WriteThroughputBudget is same as in previous version. Even the maxExecutors set for the spark job did not change.

However the performance seems have to have taken a big hit.

Question - Please suggest if we missed any migration step when upgrading versions which might have resulted in performance degradation.

We don’t see any intermittent errors\issues from the spark logs during this save operation.

Code snippet to initialize the Config passed to CosmosDBSpark.Save(df, config) method:

Performance numbers before and after upgrade:

The text was updated successfully, but these errors were encountered:

manums · 2022-03-03T07:46:05Z

We have tried CosmosDbConnectionMode to gateway and we see the same performance degradation even with gateway mode.

manums · 2022-03-04T09:46:01Z

I tried calling Save() test exact same dataframe under same spark config and exactly same WriteThroughBudget (160K RUs).

I observed huge performance difference between 2.0.1 vs 3.7.0 (in few multiples).

This degradation of performance is hurting our product performance. I request your assistance here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues observed after updating from azure-cosmosdb-spark_2.4.0_2.11-2.0.1 to azure-cosmosdb-spark_2.4.0_2.11-3.70 #469

Performance issues observed after updating from azure-cosmosdb-spark_2.4.0_2.11-2.0.1 to azure-cosmosdb-spark_2.4.0_2.11-3.70 #469

manums commented Mar 3, 2022 •

edited

Loading

manums commented Mar 3, 2022

manums commented Mar 4, 2022

Performance issues observed after updating from azure-cosmosdb-spark_2.4.0_2.11-2.0.1 to azure-cosmosdb-spark_2.4.0_2.11-3.70 #469

Performance issues observed after updating from azure-cosmosdb-spark_2.4.0_2.11-2.0.1 to azure-cosmosdb-spark_2.4.0_2.11-3.70 #469

Comments

manums commented Mar 3, 2022 • edited Loading

Issue

Code snippet to initialize the Config passed to CosmosDBSpark.Save(df, config) method:

Performance numbers before and after upgrade:

manums commented Mar 3, 2022

manums commented Mar 4, 2022

manums commented Mar 3, 2022 •

edited

Loading