Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Azure Storage SDK to a modern version #629

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

RomanIakovlev
Copy link
Contributor

This change is a prerequisite to enable token-based authentication for the Azure Storage operations (blobs and queues). This change leaves the options to use the existing, connection string-based authentication though.

Copy link
Collaborator

@elrayle elrayle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments. I know you plan to do more testing, so I'll take another look once that is done.

Comment on lines 105 to 106
const properties = await this.queueClient.getProperties()
return { count: properties.approximateMessagesCount }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely applies in other places as well. In the original code, error handling logged the error. In the new code, it appears that error handling depends on how await handles exceptions instead of using try and catch blocks to log the error in the same way as the original.

Suggested change
const properties = await this.queueClient.getProperties()
return { count: properties.approximateMessagesCount }
try {
const properties = await this.queueClient.getProperties()
return { count: properties.approximateMessagesCount }
} catch (error) {
this.logger.error(error) // Log the error
return null // Handle the error by returning null
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a good point. There was one more place like this, I've updated it as well.

if (connectionString) {
this.client = QueueServiceClient.fromConnectionString(connectionString, pipelineOptions)
} else {
this.client = new QueueServiceClient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice update here for maintaining the old approach and setting up a more flexible long term approach.

const retryOperations = new AzureStorage.ExponentialRetryPolicyFilter()
this.client = AzureStorage.createQueueService(connectionString).withFilter(retryOperations)
constructor(connectionString, options) {
const pipelineOptions = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the use of pipelineOptions. Makes it easier to see the configs related to the queues.

}

// This API can only be used for the 'deadletter' store because we cannot look up documents by type performantly
async count(type, force = false) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you talk about the reason this is deleted completely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an oversight, I've brought it back. Thanks for catching it!

@RomanIakovlev RomanIakovlev force-pushed the roman/azure_sdk branch 5 times, most recently from 3180953 to d4ba41c Compare January 20, 2025 07:42
Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for upgrading the sdk! A couple of comments for you to consider.

ghcrawler/providers/queuing/storageQueue.js Outdated Show resolved Hide resolved
ghcrawler/providers/queuing/storageQueue.js Show resolved Hide resolved
package.json Outdated Show resolved Hide resolved
ghcrawler/providers/storage/storageDocStore.js Outdated Show resolved Hide resolved
Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for incorporating the changes.

@qtomlinson
Copy link
Collaborator

qtomlinson commented Feb 7, 2025

Upon further testing, an error was observed when writing the deadletter to the blob store.

Expected:
The error information should be persisted in the deadletter path.

Observed:
The error information was not persisted.

Steps to reproduce:

  1. Set up Azure storage using Azurite. See PR
  2. Trigger a harvest with an invalid type, e.g., a POST call to localhost:4000/harvest with the following payload:
    [
        {
            "tool": "bogus",
            "coordinates": "pypi/pypi/-/platformdirs/4.2.0"
        }
    ]
    There should be an error because "bogus" is an invalid tool. The error information should be persisted in the deadletter folder. However, this error information is not persisted. The following error was observed:
    Error: options.metadata with value "1" must be of type string.

@qtomlinson
Copy link
Collaborator

Error writing deadletters can be addressed in a separate PR.

ljones140 and others added 2 commits February 11, 2025 11:48
So we have the ability to have an harvest connection with connections string and queues with azure SPN
…om-harvest

Separate crawler queue connection from harvest
@ljones140
Copy link
Contributor

After I added a commit to change queue configuration I've deployed to dev and run the integration tests.

https://github.com/clearlydefined/operations/actions/runs/13264887396/job/37029783039#step:7:355

The failure looks ok to me as caused by a recent release to a package checked in the test

@qtomlinson
Copy link
Collaborator

@ljones140 Thank you for your testing! Please note that our integration tests are designed for sanity checks and do not cover all of our use cases. During testing, I observed that the crawler queue was not cleared after all requests were completed (see below, 10 hidden messages in the queue). I suspect that this issue was related to the previous code change where the message receipt was not returned in the storageQueue.updateVisibilityTimeout method. This has been addressed in subsequent commits. Could you please verify that the crawler queue is now cleared after all requests are completed?
image

@ljones140
Copy link
Contributor

@qtomlinson I ran the Integration tests again and the queues cleared.
Screenshot 2025-02-12 at 15 00 40

Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please address #629 (comment) in a subsequent PR.

@ljones140
Copy link
Contributor

LGTM. Please address #629 (comment) in a subsequent PR.

@qtomlinson adding the fix to this PR as it was very simple edc8bb2

But did take a long time toe debug. Used the Azure docker env which was very helpful

@ljones140
Copy link
Contributor

Proof of deadletter working
Screenshot 2025-02-13 at 18 08 43

Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix! I am glad that the azurite setup makes debugging easier.

@@ -638,7 +638,7 @@ class Crawler {
metadata.errorMessage = request._error.message
metadata.errorStack = request._error.stack
}
metadata.version = "1"
metadata.version = '1'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be updated to '1.1' as we are changing the file format for the deadletter here?

  • Deadletters are internal exception information. The change here looks acceptable to me.
  • Comparing the metadata structure of a deadletter and a harvested result (both are persisted documents in storage), there is no metadata.version in recent harvest results. Older harvest results prior to 2019 have metadata.version as a string. In the future, we may want to align with the metadata in harvested results and consider using schemaVersion instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants