Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor URL extraction logic in url_validator.py #278

Merged
merged 2 commits into from
Apr 24, 2024

Conversation

rajeshpandey2053
Copy link
Contributor

@rajeshpandey2053 rajeshpandey2053 commented Apr 5, 2024

Description:

This pull request addresses the discrepancy in URL recommendations between pyQuARC and QuARC.

Issue:

We are running pyQuARC in AWS Lambda functions to build the QuARC API. Lambda only supports a read-only file system. If someone attempts to write something to the Lambda, it throws an error. In our case, pyQuARC uses the urlextract package, which attempts to save some files in local storage for caching purposes, resulting in the error. The following is an extract from the class implemented in the pyQuARC.
Initialize function for URLExtract class. Tries to get cached TLDs, if the cached file does not exist it will try to download the new list from IANA and save it to cache file.

Process:

To resolve this, the dependency on the urlextract package has been replaced with regex expressions for URL extraction, eliminating the file system dependency.

Testing :

Testing was done with the following list of concept ids with their respective formats ensuring we extract same list of URLs from a text using the regex expressions and urlextract package.

For further details :
#273

@jenny-m-wood
Copy link
Collaborator

Thanks for your changes. I noticed the following during testing:

  • When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output:
    Screenshot 2024-04-08 at 1 37 24 PM

  • When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output:
    Screenshot 2024-04-08 at 2 22 26 PM
    See screenshot of pyQuARC fix_check_url output:
    Screenshot 2024-04-08 at 2 23 17 PM

@xhagrg
Copy link
Contributor

xhagrg commented Apr 9, 2024

Did you check lipoja/URLExtract#61 @rajeshpandey2053? We should be using packages when possible.

@rajeshpandey2053
Copy link
Contributor Author

rajeshpandey2053 commented Apr 10, 2024

Thanks for your changes. I noticed the following during testing:

  • When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output:
    Screenshot 2024-04-08 at 1 37 24 PM
  • When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output:
    Screenshot 2024-04-08 at 2 22 26 PM
    See screenshot of pyQuARC fix_check_url output:
    Screenshot 2024-04-08 at 2 23 17 PM
  • The first one is due to the dev environment of quARC having a master branch of pyQuARC. I have updated it to the dev branch.
  • Second discrepancy is due to we have used python 3.8 as a lambda runtime version in quARC. And it uses the old version of OpenSSL. Created new ticket to resolve this issue

@xhagrg
Copy link
Contributor

xhagrg commented Apr 24, 2024

@rajeshpandey2053 has this been tested? If yes, LGTM.

@rajeshpandey2053
Copy link
Contributor Author

@rajeshpandey2053 has this been tested? If yes, LGTM.

Yes, it has been tested as well. Thank you will merge it then

@rajeshpandey2053 rajeshpandey2053 merged commit 246eb9a into dev Apr 24, 2024
1 check passed
@slesaad slesaad mentioned this pull request Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants