-
Notifications
You must be signed in to change notification settings - Fork 38
Add ERDDAP URL support for dataset ingestion in ioos_qc streams #184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Replace mocked HTTP test with integration test against actual ERDDAP server. Tests latitude and longitude QC on real dataset from NOAA NCEI ERDDAP. Verifies end-to-end functionality with live data source.
|
@tanmayrajurkar are you using AI to generate this code? Can you comment of the real use cases for these changes? Like a real life example/notebook? |
|
@ocefpaf Thanks for asking. On AI usage: I do use AI-assisted tools in my workflow (similar to On real-world use cases: this change is motivated by practical marine Being able to point ioos_qc directly at an ERDDAP URL allows QC to run Happy to add a small example or notebook if that would be useful. |
| def _is_http_url(value: object) -> bool: | ||
| """Return True if value is an http(s) URL string.""" | ||
| if not isinstance(value, str): | ||
| return False | ||
| parsed = urllib.parse.urlparse(value) | ||
| return parsed.scheme in {"http", "https"} and bool(parsed.netloc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanmayrajurkar in #184 (comment) you said:
I make sure I fully understand and validate everything before submitting.
In light of that can I ask you to review this function as if you did not authored it? Do you see anything that can be improved, fixed, streamlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf
Yes-reviewing it as if I didn't write it:
The intent of _is_http_url is to be a very small explicit guard to distinguish local paths from remote resources.
It returns True only for explicit http/https schemes, and with a non-empty network location (netloc), so things like relative paths or Exclude URLs like: file://.
I've kept the code on purpose minimal and side-effect free since it's only used as a routing helper and not as a full URL validator. If you think and supporting additional schemes or edge cases would be preferable. I'm happy to accommodate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an AI description, right? A good review would question why use an object for type checking if only str is allowed and enforced in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf Actually yes— the explanation was over-formal and not very clear. I will put a simple version of it in my own words below.
Its correct — there’s no strong reason to accept object here.
This should just take str, since anything else is rejected immediately.
| L = logging.getLogger(__name__) | ||
|
|
||
| _ERDDAP_FORMAT_RE = re.compile( | ||
| r"\.(?P<fmt>csv|csvp|csv0|nc|nccf|nc4|cdf)(?P<tail>$|\?)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanmayrajurkar can you explain this REGEX in words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
This regular expression pattern is for identifying the URL pattern for datasets on sites that use the ERDDAP server type and involves looking for known data format extension - towards the end of the path, optionally followed by a query string.
Explaining It Further:
- The
\.matches the literal dot before the format. - (
(?P<fmt>.)captures the dataset format, which could be csv, csvp, csv0, nc, nccf, nc4, cdf (?P<tail>$|\?)ensures the format appears either at the end of the URL or immediately before a?, which is typical for ERDDAP tabledap queries.
So it matches URLs like:
./dataset.csv./dataset.nc./dataset.csv?time,temperature
The goal is not general URL validation, but lightweight detection of ERDDAP-style CSV/NetCDF endpoints so they can be routed to the correct loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't a REGEX and overkill for this taks? Also, check what erddapy does. Try to read its code and not just AI-parse it. We can learn a lot by reading code from a real human.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf Agreed – the regex is not needed for this purpose. On further review of the erddapy library, it is apparent that directly relying on the library is the simpler and better solution. I will eliminate the regex-based mechanisms altogether.
Sure. While we do expect that data in an ERDDAP server was already QCed, that can happen and/or be part of someone workflow. However, the fetching data, or streaming, should be a function of an online app and not the core QC library. What do you think? |
| def _fetch_url_bytes(url: str, *, timeout: float = 30.0) -> bytes: | ||
| """Fetch URL content as bytes with basic, user-friendly errors.""" | ||
| try: | ||
| req = urllib.request.Request( # noqa: S310 | ||
| url, | ||
| headers={"User-Agent": "ioos_qc (python urllib)"}, | ||
| ) | ||
| with urllib.request.urlopen(req, timeout=timeout) as resp: # noqa: S310 | ||
| return resp.read() | ||
| except urllib.error.HTTPError as e: | ||
| msg = f"HTTP error fetching {url!r}: {e.code} {getattr(e, 'reason', '')}".strip() | ||
| raise ValueError(msg) from e | ||
| except urllib.error.URLError as e: | ||
| msg = f"URL error fetching {url!r}: {getattr(e, 'reason', e)}" | ||
| raise ValueError(msg) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cyclomatic complexity here is high for a simple fetch function. Also, requests is already in the dependency tree, one could make it a hard dependency and use that instead of bringing raw urllib calls. Last but not least, erddapy is already a hard dependency and can fetch ERDDAP data reducing a lot of the code complexity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf
That’s a fair point — agreed.
I kept this as urllib initially to avoid introducing new hard dependencies, but I agree the function is more complex than it should be for what it does. Given that requests is already in the dependency
tree, switching to it would simplify both the code and error handling.
I also agree that leaning on erddapy is likely the cleanest approach for ERDDAP-specific access and would reduce this logic substantially. I’m happy to refactor in that direction or move this out of core entirely if that’s preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a fair point — agreed.
It is very tiresome for a human to keep reading AI generated comments. @tanmayrajurkar, can you use your own "voice" instead? It would be easier to understand your needs that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf Noted. Thanks for pointing that out. I will keep my answers more brief and written in my own way.
That’s a fair point — I agree that full fetching/streaming logic belongs in an application layer, not the core QC library. My intent here was only a very thin convenience to resolve a dataset reference (local path vs remote endpoint) before handing off to the existing stream classes, not to turn ioos_qc into an online client. If even that feels out of scope for core, I’m happy to move this logic out to an example/helper and keep ioos_qc strictly file/object-based. |
Refs #183
Summary
This PR adds support for running IOOS QC directly from public ERDDAP
dataset URLs, without changing any existing QC logic or behavior.
The change introduces a small helper that accepts either a local file
path or an ERDDAP URL (CSV or NetCDF) and routes the loaded data through
the existing stream classes.
What Changed
stream_from_path_or_erddap_url()helper toioos_qc.streamsPandasStreamandXarrayStreamclassesWhat Did NOT Change
Motivation
Many oceanographic datasets are accessed via ERDDAP. Requiring users to
manually download datasets before running QC adds unnecessary friction.
This change enables a more direct and exploratory QC workflow while
keeping the library API minimal and backward compatible.
Testing
Feedback welcome, especially on API placement and naming.