Add ERDDAP URL support for dataset ingestion in ioos_qc streams #184

tanmayrajurkar · 2025-12-24T13:39:43Z

Summary

This PR adds support for running IOOS QC directly from public ERDDAP
dataset URLs, without changing any existing QC logic or behavior.

The change introduces a small helper that accepts either a local file
path or an ERDDAP URL (CSV or NetCDF) and routes the loaded data through
the existing stream classes.

What Changed

Added stream_from_path_or_erddap_url() helper to ioos_qc.streams
Supports:
- ERDDAP CSV endpoints (with proper handling of the units row)
- ERDDAP NetCDF endpoints (loaded via xarray, in-memory)
Reuses existing PandasStream and XarrayStream classes
Added unit tests with mocked HTTP responses
Added a brief documentation note describing ERDDAP URL support

What Did NOT Change

No changes to QC algorithms or test behavior
No UI or web-specific logic
No new heavy dependencies
Local file-based workflows remain unchanged

Motivation

Many oceanographic datasets are accessed via ERDDAP. Requiring users to
manually download datasets before running QC adds unnecessary friction.
This change enables a more direct and exploratory QC workflow while
keeping the library API minimal and backward compatible.

Testing

New tests mock HTTP requests (no network dependency)
Tests cover both CSV and NetCDF ERDDAP endpoints
Verified that ERDDAP CSV units rows are correctly skipped

Feedback welcome, especially on API placement and naming.

Replace mocked HTTP test with integration test against actual ERDDAP server. Tests latitude and longitude QC on real dataset from NOAA NCEI ERDDAP. Verifies end-to-end functionality with live data source.

ocefpaf · 2026-01-06T16:59:02Z

@tanmayrajurkar are you using AI to generate this code? Can you comment of the real use cases for these changes? Like a real life example/notebook?

tanmayrajurkar · 2026-01-06T17:10:32Z

@ocefpaf Thanks for asking.

On AI usage: I do use AI-assisted tools in my workflow (similar to
pair-programming), but all code in this PR was designed, implemented,
reviewed, and tested by me. I make sure I fully understand and validate
everything before submitting.

On real-world use cases: this change is motivated by practical marine
data workflows I’ve worked on (e.g., ocean current and temperature
visualization pipelines), where datasets are accessed directly via
remote CSV/NetCDF endpoints (ERDDAP-style). Requiring manual downloads
before running QC adds friction and hurts reproducibility.

Being able to point ioos_qc directly at an ERDDAP URL allows QC to run
earlier in exploratory and notebook-based workflows, using existing
streams and QC logic unchanged.

Happy to add a small example or notebook if that would be useful.

ocefpaf · 2026-01-06T17:31:44Z

ioos_qc/streams.py

+def _is_http_url(value: object) -> bool:
+    """Return True if value is an http(s) URL string."""
+    if not isinstance(value, str):
+        return False
+    parsed = urllib.parse.urlparse(value)
+    return parsed.scheme in {"http", "https"} and bool(parsed.netloc)


@tanmayrajurkar in #184 (comment) you said:

I make sure I fully understand and validate everything before submitting.

In light of that can I ask you to review this function as if you did not authored it? Do you see anything that can be improved, fixed, streamlined?

@ocefpaf
Yes-reviewing it as if I didn't write it:
The intent of _is_http_url is to be a very small explicit guard to distinguish local paths from remote resources.
It returns True only for explicit http/https schemes, and with a non-empty network location (netloc), so things like relative paths or Exclude URLs like: file://.
I've kept the code on purpose minimal and side-effect free since it's only used as a routing helper and not as a full URL validator. If you think and supporting additional schemes or edge cases would be preferable. I'm happy to accommodate.

That is an AI description, right? A good review would question why use an object for type checking if only str is allowed and enforced in the function?

@ocefpaf Actually yes— the explanation was over-formal and not very clear. I will put a simple version of it in my own words below.
Its correct — there’s no strong reason to accept object here.
This should just take str, since anything else is rejected immediately.

ocefpaf · 2026-01-06T17:32:33Z

ioos_qc/streams.py

 L = logging.getLogger(__name__)

+_ERDDAP_FORMAT_RE = re.compile(
+    r"\.(?P<fmt>csv|csvp|csv0|nc|nccf|nc4|cdf)(?P<tail>$|\?)",


@tanmayrajurkar can you explain this REGEX in words?

@ocefpaf

Sure.

This regular expression pattern is for identifying the URL pattern for datasets on sites that use the ERDDAP server type and involves looking for known data format extension - towards the end of the path, optionally followed by a query string.
Explaining It Further:

The \. matches the literal dot before the format.

((?P<fmt>.) captures the dataset format, which could be csv, csvp, csv0, nc, nccf, nc4, cdf

(?P<tail>$|\?) ensures the format appears either at the end of the URL or immediately before a ?, which is typical for ERDDAP tabledap queries.

So it matches URLs like:

./dataset.csv

./dataset.nc

./dataset.csv?time,temperature

The goal is not general URL validation, but lightweight detection of ERDDAP-style CSV/NetCDF endpoints so they can be routed to the correct loader.

Isn't a REGEX and overkill for this taks? Also, check what erddapy does. Try to read its code and not just AI-parse it. We can learn a lot by reading code from a real human.

@ocefpaf Agreed – the regex is not needed for this purpose. On further review of the erddapy library, it is apparent that directly relying on the library is the simpler and better solution. I will eliminate the regex-based mechanisms altogether.

ocefpaf · 2026-01-06T17:34:44Z

Requiring manual downloads
before running QC adds friction and hurts reproducibility.

Sure. While we do expect that data in an ERDDAP server was already QCed, that can happen and/or be part of someone workflow. However, the fetching data, or streaming, should be a function of an online app and not the core QC library. What do you think?

ocefpaf · 2026-01-06T17:39:13Z

ioos_qc/streams.py

+def _fetch_url_bytes(url: str, *, timeout: float = 30.0) -> bytes:
+    """Fetch URL content as bytes with basic, user-friendly errors."""
+    try:
+        req = urllib.request.Request(  # noqa: S310
+            url,
+            headers={"User-Agent": "ioos_qc (python urllib)"},
+        )
+        with urllib.request.urlopen(req, timeout=timeout) as resp:  # noqa: S310
+            return resp.read()
+    except urllib.error.HTTPError as e:
+        msg = f"HTTP error fetching {url!r}: {e.code} {getattr(e, 'reason', '')}".strip()
+        raise ValueError(msg) from e
+    except urllib.error.URLError as e:
+        msg = f"URL error fetching {url!r}: {getattr(e, 'reason', e)}"
+        raise ValueError(msg) from e


The cyclomatic complexity here is high for a simple fetch function. Also, requests is already in the dependency tree, one could make it a hard dependency and use that instead of bringing raw urllib calls. Last but not least, erddapy is already a hard dependency and can fetch ERDDAP data reducing a lot of the code complexity here.

@ocefpaf
That’s a fair point — agreed.

I kept this as urllib initially to avoid introducing new hard dependencies, but I agree the function is more complex than it should be for what it does. Given that requests is already in the dependency
tree, switching to it would simplify both the code and error handling.

I also agree that leaning on erddapy is likely the cleanest approach for ERDDAP-specific access and would reduce this logic substantially. I’m happy to refactor in that direction or move this out of core entirely if that’s preferable.

That’s a fair point — agreed.

It is very tiresome for a human to keep reading AI generated comments. @tanmayrajurkar, can you use your own "voice" instead? It would be easier to understand your needs that way.

@ocefpaf Noted. Thanks for pointing that out. I will keep my answers more brief and written in my own way.

tanmayrajurkar · 2026-01-06T17:55:20Z

Requiring manual downloads
before running QC adds friction and hurts reproducibility.

Sure. While we do expect that data in an ERDDAP server was already QCed, that can happen and/or be part of someone workflow. However, the fetching data, or streaming, should be a function of an online app and not the core QC library. What do you think?

That’s a fair point — I agree that full fetching/streaming logic belongs in an application layer, not the core QC library.

My intent here was only a very thin convenience to resolve a dataset reference (local path vs remote endpoint) before handing off to the existing stream classes, not to turn ioos_qc into an online client.

If even that feels out of scope for core, I’m happy to move this logic out to an example/helper and keep ioos_qc strictly file/object-based.

tanmayrajurkar added 3 commits December 24, 2025 19:03

Add ERDDAP URL support for dataset ingestion

383a19c

Fix pre-commit and ruff issues for ERDDAP URL support

a92d876

test: Update ERDDAP CSV test to use real NOAA endpoint

fd1f8c3

Replace mocked HTTP test with integration test against actual ERDDAP server. Tests latitude and longitude QC on real dataset from NOAA NCEI ERDDAP. Verifies end-to-end functionality with live data source.

tanmayrajurkar temporarily deployed to pypi January 6, 2026 16:55 — with GitHub Actions Inactive

ocefpaf reviewed Jan 6, 2026

View reviewed changes

Add ERDDAP URL support for dataset ingestion in ioos_qc streams #184

Are you sure you want to change the base?

Add ERDDAP URL support for dataset ingestion in ioos_qc streams #184

Uh oh!

Conversation

tanmayrajurkar commented Dec 24, 2025

Summary

What Changed

What Did NOT Change

Motivation

Testing

Uh oh!

ocefpaf commented Jan 6, 2026

Uh oh!

tanmayrajurkar commented Jan 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanmayrajurkar Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanmayrajurkar Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanmayrajurkar Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocefpaf commented Jan 6, 2026

Uh oh!

ocefpaf Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanmayrajurkar commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanmayrajurkar Jan 8, 2026 •

edited

Loading

tanmayrajurkar Jan 6, 2026 •

edited

Loading

tanmayrajurkar Jan 8, 2026 •

edited

Loading

ocefpaf Jan 6, 2026 •

edited

Loading