Refactor URL extraction logic in url_validator.py #278

rajeshpandey2053 · 2024-04-05T15:14:04Z

Description:

This pull request addresses the discrepancy in URL recommendations between pyQuARC and QuARC.

Issue:

We are running pyQuARC in AWS Lambda functions to build the QuARC API. Lambda only supports a read-only file system. If someone attempts to write something to the Lambda, it throws an error. In our case, pyQuARC uses the urlextract package, which attempts to save some files in local storage for caching purposes, resulting in the error. The following is an extract from the class implemented in the pyQuARC.
Initialize function for URLExtract class. Tries to get cached TLDs, if the cached file does not exist it will try to download the new list from IANA and save it to cache file.

Process:

To resolve this, the dependency on the urlextract package has been replaced with regex expressions for URL extraction, eliminating the file system dependency.

Testing :

Testing was done with the following list of concept ids with their respective formats ensuring we extract same list of URLs from a text using the regex expressions and urlextract package.

For further details :
#273

jenny-m-wood · 2024-04-08T19:23:41Z

Thanks for your changes. I noticed the following during testing:

When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output:
When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output:

See screenshot of pyQuARC fix_check_url output:

xhagrg · 2024-04-09T20:50:46Z

Did you check lipoja/URLExtract#61 @rajeshpandey2053? We should be using packages when possible.

rajeshpandey2053 · 2024-04-10T22:17:48Z

Thanks for your changes. I noticed the following during testing:

When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output:

When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output:

See screenshot of pyQuARC fix_check_url output:

The first one is due to the dev environment of quARC having a master branch of pyQuARC. I have updated it to the dev branch.
Second discrepancy is due to we have used python 3.8 as a lambda runtime version in quARC. And it uses the old version of OpenSSL. Created new ticket to resolve this issue

…ctor

xhagrg · 2024-04-24T13:44:12Z

@rajeshpandey2053 has this been tested? If yes, LGTM.

rajeshpandey2053 · 2024-04-24T13:55:00Z

@rajeshpandey2053 has this been tested? If yes, LGTM.

Yes, it has been tested as well. Thank you will merge it then

Refactor URL extraction logic in url_validator.py

6861319

rajeshpandey2053 requested review from slesaad and jenny-m-wood April 5, 2024 15:14

update urlextract version and provide cache directory info in constru…

7bfce69

…ctor

rajeshpandey2053 merged commit 246eb9a into dev Apr 24, 2024
1 check passed

slesaad mentioned this pull request May 24, 2024

Bug: [Errno 30] Read-only file system: '/home/sbx_user1051' in certain checks NASA-IMPACT/QuARC#31

Closed

slesaad mentioned this pull request Jun 24, 2024

Release 1.2.7 #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor URL extraction logic in url_validator.py #278

Refactor URL extraction logic in url_validator.py #278

rajeshpandey2053 commented Apr 5, 2024 •

edited

Loading

jenny-m-wood commented Apr 8, 2024

xhagrg commented Apr 9, 2024 •

edited

Loading

rajeshpandey2053 commented Apr 10, 2024 •

edited

Loading

xhagrg commented Apr 24, 2024

rajeshpandey2053 commented Apr 24, 2024

Refactor URL extraction logic in url_validator.py #278

Refactor URL extraction logic in url_validator.py #278

Conversation

rajeshpandey2053 commented Apr 5, 2024 • edited Loading

Description:

Issue:

Process:

Testing :

jenny-m-wood commented Apr 8, 2024

xhagrg commented Apr 9, 2024 • edited Loading

rajeshpandey2053 commented Apr 10, 2024 • edited Loading

xhagrg commented Apr 24, 2024

rajeshpandey2053 commented Apr 24, 2024

rajeshpandey2053 commented Apr 5, 2024 •

edited

Loading

xhagrg commented Apr 9, 2024 •

edited

Loading

rajeshpandey2053 commented Apr 10, 2024 •

edited

Loading