Upgrade Azure Storage SDK to a modern version #629

RomanIakovlev · 2024-12-13T15:33:37Z

This change is a prerequisite to enable token-based authentication for the Azure Storage operations (blobs and queues). This change leaves the options to use the existing, connection string-based authentication though.

elrayle

A couple of comments. I know you plan to do more testing, so I'll take another look once that is done.

elrayle · 2024-12-13T16:03:53Z

ghcrawler/providers/queuing/storageQueue.js

+    const properties = await this.queueClient.getProperties()
+    return { count: properties.approximateMessagesCount }


This likely applies in other places as well. In the original code, error handling logged the error. In the new code, it appears that error handling depends on how await handles exceptions instead of using try and catch blocks to log the error in the same way as the original.

Suggested change

const properties = await this.queueClient.getProperties()

return { count: properties.approximateMessagesCount }

try {

const properties = await this.queueClient.getProperties()

return { count: properties.approximateMessagesCount }

} catch (error) {

this.logger.error(error) // Log the error

return null // Handle the error by returning null

}

Yeah it's a good point. There was one more place like this, I've updated it as well.

elrayle · 2024-12-13T16:15:15Z

ghcrawler/providers/queuing/storageQueueManager.js

+    if (connectionString) {
+      this.client = QueueServiceClient.fromConnectionString(connectionString, pipelineOptions)
+    } else {
+      this.client = new QueueServiceClient(


This is a nice update here for maintaining the old approach and setting up a more flexible long term approach.

elrayle · 2024-12-13T16:18:44Z

ghcrawler/providers/queuing/storageQueueManager.js

-    const retryOperations = new AzureStorage.ExponentialRetryPolicyFilter()
-    this.client = AzureStorage.createQueueService(connectionString).withFilter(retryOperations)
+  constructor(connectionString, options) {
+    const pipelineOptions = {


I like the use of pipelineOptions. Makes it easier to see the configs related to the queues.

elrayle · 2024-12-13T16:31:08Z

ghcrawler/providers/storage/storageDocStore.js

-  }
-
-  // This API can only be used for the 'deadletter' store because we cannot look up documents by type performantly
-  async count(type, force = false) {


Can you talk about the reason this is deleted completely?

It was an oversight, I've brought it back. Thanks for catching it!

…queues

qtomlinson

Thanks for upgrading the sdk! A couple of comments for you to consider.

ghcrawler/providers/queuing/storageQueue.js

package.json

ghcrawler/providers/storage/storageDocStore.js

ghcrawler/providers/queuing/storageQueue.js

qtomlinson

Thanks for incorporating the changes.

qtomlinson · 2025-02-07T05:18:43Z

Upon further testing, an error was observed when writing the deadletter to the blob store.

Expected:
The error information should be persisted in the deadletter path.

Observed:
The error information was not persisted.

Steps to reproduce:

Set up Azure storage using Azurite. See PR
Trigger a harvest with an invalid type, e.g., a POST call to localhost:4000/harvest with the following payload:
```
[
    {
        "tool": "bogus",
        "coordinates": "pypi/pypi/-/platformdirs/4.2.0"
    }
]
```
There should be an error because "bogus" is an invalid tool. The error information should be persisted in the deadletter folder. However, this error information is not persisted. The following error was observed:
Error: options.metadata with value "1" must be of type string.

qtomlinson · 2025-02-07T15:43:19Z

Error writing deadletters can be addressed in a separate PR.

So we have the ability to have an harvest connection with connections string and queues with azure SPN

…om-harvest Separate crawler queue connection from harvest

ljones140 · 2025-02-11T17:58:31Z

After I added a commit to change queue configuration I've deployed to dev and run the integration tests.

https://github.com/clearlydefined/operations/actions/runs/13264887396/job/37029783039#step:7:355

The failure looks ok to me as caused by a recent release to a package checked in the test

qtomlinson · 2025-02-11T19:48:59Z

@ljones140 Thank you for your testing! Please note that our integration tests are designed for sanity checks and do not cover all of our use cases. During testing, I observed that the crawler queue was not cleared after all requests were completed (see below, 10 hidden messages in the queue). I suspect that this issue was related to the previous code change where the message receipt was not returned in the storageQueue.updateVisibilityTimeout method. This has been addressed in subsequent commits. Could you please verify that the crawler queue is now cleared after all requests are completed?

ljones140 · 2025-02-12T15:01:47Z

@qtomlinson I ran the Integration tests again and the queues cleared.

qtomlinson

LGTM. Please address #629 (comment) in a subsequent PR.

Needs to be a string

ljones140 · 2025-02-13T18:01:56Z

LGTM. Please address #629 (comment) in a subsequent PR.

@qtomlinson adding the fix to this PR as it was very simple edc8bb2

But did take a long time toe debug. Used the Azure docker env which was very helpful

ljones140 · 2025-02-13T18:09:30Z

Proof of deadletter working

qtomlinson

Thanks for the quick fix! I am glad that the azurite setup makes debugging easier.

qtomlinson · 2025-02-13T20:52:42Z

ghcrawler/lib/crawler.js

@@ -638,7 +638,7 @@ class Crawler {
      metadata.errorMessage = request._error.message
      metadata.errorStack = request._error.stack
    }
-    metadata.version = "1"
+    metadata.version = '1'


Should this be updated to '1.1' as we are changing the file format for the deadletter here?

Deadletters are internal exception information. The change here looks acceptable to me.

Comparing the metadata structure of a deadletter and a harvested result (both are persisted documents in storage), there is no metadata.version in recent harvest results. Older harvest results prior to 2019 have metadata.version as a string. In the future, we may want to align with the metadata in harvested results and consider using schemaVersion instead.

RomanIakovlev requested review from elrayle and qtomlinson December 13, 2024 15:33

elrayle reviewed Dec 13, 2024

View reviewed changes

RomanIakovlev force-pushed the roman/azure_sdk branch 5 times, most recently from 3180953 to d4ba41c Compare January 20, 2025 07:42

RomanIakovlev added 11 commits January 20, 2025 11:47

Upgrade Azure Storage SDK to a modern version

4a2eba6

Add back AzureStorageDocStore.count method

0d00eaf

Tweak async error handling, apply prettier

13bfce9

Fix code style issues

7003187

Fix code style issues

305b882

Fix code style issues

a6a56ec

Add support for separate service principal credentials for blobs and …

e12ba67

…queues

Add missing config values

73c02a3

Add more logging around queue message parsing

bebef96

Decode Azure queue message before parsing

dad4e0d

Fix the parameter passing in storage queue updateMessage call

b038b72

RomanIakovlev force-pushed the roman/azure_sdk branch from d4ba41c to b038b72 Compare January 20, 2025 10:48

qtomlinson reviewed Jan 28, 2025

View reviewed changes

Fix code review comments

b437f9c

qtomlinson reviewed Feb 5, 2025

View reviewed changes

ghcrawler/providers/storage/storageDocStore.js Show resolved Hide resolved

ghcrawler/providers/queuing/storageQueue.js Show resolved Hide resolved

qtomlinson reviewed Feb 5, 2025

View reviewed changes

RomanIakovlev added 2 commits February 6, 2025 13:11

Ensure messageId is included into message receipt

6766b84

Add safe XML+HTML codecs to storage queue

4c41da0

ljones140 and others added 2 commits February 11, 2025 11:48

Queues can be configued separatly with SPN from harvest azblob

a7e714c

So we have the ability to have an harvest connection with connections string and queues with azure SPN

Merge pull request #633 from clearlydefined/seperate-crawler-queue-fr…

51c4b3e

…om-harvest Separate crawler queue connection from harvest

Merge branch 'master' into roman/azure_sdk

3354e53

name is not set anymore, options.container is the new place for this

5d6eca2

qtomlinson approved these changes Feb 13, 2025

View reviewed changes

Fix integer here was breaking dead letter queue writing

edc8bb2

Needs to be a string

single quotes

4b67c4a

qtomlinson reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade Azure Storage SDK to a modern version #629

Upgrade Azure Storage SDK to a modern version #629

RomanIakovlev commented Dec 13, 2024

elrayle left a comment

elrayle Dec 13, 2024

RomanIakovlev Dec 17, 2024

elrayle Dec 13, 2024

elrayle Dec 13, 2024

elrayle Dec 13, 2024

RomanIakovlev Dec 17, 2024

qtomlinson left a comment

qtomlinson left a comment

qtomlinson commented Feb 7, 2025 •

edited

Loading

qtomlinson commented Feb 7, 2025

ljones140 commented Feb 11, 2025

qtomlinson commented Feb 11, 2025

ljones140 commented Feb 12, 2025

qtomlinson left a comment

ljones140 commented Feb 13, 2025

ljones140 commented Feb 13, 2025

qtomlinson left a comment

qtomlinson Feb 13, 2025

		const properties = await this.queueClient.getProperties()
		return { count: properties.approximateMessagesCount }

Upgrade Azure Storage SDK to a modern version #629

Are you sure you want to change the base?

Upgrade Azure Storage SDK to a modern version #629

Conversation

RomanIakovlev commented Dec 13, 2024

elrayle left a comment

Choose a reason for hiding this comment

elrayle Dec 13, 2024

Choose a reason for hiding this comment

RomanIakovlev Dec 17, 2024

Choose a reason for hiding this comment

elrayle Dec 13, 2024

Choose a reason for hiding this comment

elrayle Dec 13, 2024

Choose a reason for hiding this comment

elrayle Dec 13, 2024

Choose a reason for hiding this comment

RomanIakovlev Dec 17, 2024

Choose a reason for hiding this comment

qtomlinson left a comment

Choose a reason for hiding this comment

qtomlinson left a comment

Choose a reason for hiding this comment

qtomlinson commented Feb 7, 2025 • edited Loading

qtomlinson commented Feb 7, 2025

ljones140 commented Feb 11, 2025

qtomlinson commented Feb 11, 2025

ljones140 commented Feb 12, 2025

qtomlinson left a comment

Choose a reason for hiding this comment

ljones140 commented Feb 13, 2025

ljones140 commented Feb 13, 2025

qtomlinson left a comment

Choose a reason for hiding this comment

qtomlinson Feb 13, 2025

Choose a reason for hiding this comment

qtomlinson commented Feb 7, 2025 •

edited

Loading