Skip to content

Conversation

@tanmayrajurkar
Copy link

Refs #183

Summary

This PR adds support for running IOOS QC directly from public ERDDAP
dataset URLs, without changing any existing QC logic or behavior.

The change introduces a small helper that accepts either a local file
path or an ERDDAP URL (CSV or NetCDF) and routes the loaded data through
the existing stream classes.

What Changed

  • Added stream_from_path_or_erddap_url() helper to ioos_qc.streams
  • Supports:
    • ERDDAP CSV endpoints (with proper handling of the units row)
    • ERDDAP NetCDF endpoints (loaded via xarray, in-memory)
  • Reuses existing PandasStream and XarrayStream classes
  • Added unit tests with mocked HTTP responses
  • Added a brief documentation note describing ERDDAP URL support

What Did NOT Change

  • No changes to QC algorithms or test behavior
  • No UI or web-specific logic
  • No new heavy dependencies
  • Local file-based workflows remain unchanged

Motivation

Many oceanographic datasets are accessed via ERDDAP. Requiring users to
manually download datasets before running QC adds unnecessary friction.
This change enables a more direct and exploratory QC workflow while
keeping the library API minimal and backward compatible.

Testing

  • New tests mock HTTP requests (no network dependency)
  • Tests cover both CSV and NetCDF ERDDAP endpoints
  • Verified that ERDDAP CSV units rows are correctly skipped

Feedback welcome, especially on API placement and naming.

Replace mocked HTTP test with integration test against actual ERDDAP server.
Tests latitude and longitude QC on real dataset from NOAA NCEI ERDDAP.
Verifies end-to-end functionality with live data source.
@ocefpaf
Copy link
Member

ocefpaf commented Jan 6, 2026

@tanmayrajurkar are you using AI to generate this code? Can you comment of the real use cases for these changes? Like a real life example/notebook?

@tanmayrajurkar
Copy link
Author

@ocefpaf Thanks for asking.

On AI usage: I do use AI-assisted tools in my workflow (similar to
pair-programming), but all code in this PR was designed, implemented,
reviewed, and tested by me. I make sure I fully understand and validate
everything before submitting.

On real-world use cases: this change is motivated by practical marine
data workflows I’ve worked on (e.g., ocean current and temperature
visualization pipelines), where datasets are accessed directly via
remote CSV/NetCDF endpoints (ERDDAP-style). Requiring manual downloads
before running QC adds friction and hurts reproducibility.

Being able to point ioos_qc directly at an ERDDAP URL allows QC to run
earlier in exploratory and notebook-based workflows, using existing
streams and QC logic unchanged.

Happy to add a small example or notebook if that would be useful.

Comment on lines +31 to +36
def _is_http_url(value: object) -> bool:
"""Return True if value is an http(s) URL string."""
if not isinstance(value, str):
return False
parsed = urllib.parse.urlparse(value)
return parsed.scheme in {"http", "https"} and bool(parsed.netloc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanmayrajurkar in #184 (comment) you said:

I make sure I fully understand and validate everything before submitting.

In light of that can I ask you to review this function as if you did not authored it? Do you see anything that can be improved, fixed, streamlined?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf
Yes-reviewing it as if I didn't write it:
The intent of _is_http_url is to be a very small explicit guard to distinguish local paths from remote resources.
It returns True only for explicit http/https schemes, and with a non-empty network location (netloc), so things like relative paths or Exclude URLs like: file://.
I've kept the code on purpose minimal and side-effect free since it's only used as a routing helper and not as a full URL validator. If you think and supporting additional schemes or edge cases would be preferable. I'm happy to accommodate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an AI description, right? A good review would question why use an object for type checking if only str is allowed and enforced in the function?

Copy link
Author

@tanmayrajurkar tanmayrajurkar Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf Actually yes— the explanation was over-formal and not very clear. I will put a simple version of it in my own words below.
Its correct — there’s no strong reason to accept object here.
This should just take str, since anything else is rejected immediately.

L = logging.getLogger(__name__)

_ERDDAP_FORMAT_RE = re.compile(
r"\.(?P<fmt>csv|csvp|csv0|nc|nccf|nc4|cdf)(?P<tail>$|\?)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanmayrajurkar can you explain this REGEX in words?

Copy link
Author

@tanmayrajurkar tanmayrajurkar Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf

Sure.

This regular expression pattern is for identifying the URL pattern for datasets on sites that use the ERDDAP server type and involves looking for known data format extension - towards the end of the path, optionally followed by a query string.
Explaining It Further:

  • The \. matches the literal dot before the format.
  • ((?P<fmt>.) captures the dataset format, which could be csv, csvp, csv0, nc, nccf, nc4, cdf
  • (?P<tail>$|\?) ensures the format appears either at the end of the URL or immediately before a ?, which is typical for ERDDAP tabledap queries.

So it matches URLs like:

  • ./dataset.csv
  • ./dataset.nc
  • ./dataset.csv?time,temperature

The goal is not general URL validation, but lightweight detection of ERDDAP-style CSV/NetCDF endpoints so they can be routed to the correct loader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a REGEX and overkill for this taks? Also, check what erddapy does. Try to read its code and not just AI-parse it. We can learn a lot by reading code from a real human.

Copy link
Author

@tanmayrajurkar tanmayrajurkar Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf Agreed – the regex is not needed for this purpose. On further review of the erddapy library, it is apparent that directly relying on the library is the simpler and better solution. I will eliminate the regex-based mechanisms altogether.

@ocefpaf
Copy link
Member

ocefpaf commented Jan 6, 2026

Requiring manual downloads
before running QC adds friction and hurts reproducibility.

Sure. While we do expect that data in an ERDDAP server was already QCed, that can happen and/or be part of someone workflow. However, the fetching data, or streaming, should be a function of an online app and not the core QC library. What do you think?

Comment on lines +57 to +71
def _fetch_url_bytes(url: str, *, timeout: float = 30.0) -> bytes:
"""Fetch URL content as bytes with basic, user-friendly errors."""
try:
req = urllib.request.Request( # noqa: S310
url,
headers={"User-Agent": "ioos_qc (python urllib)"},
)
with urllib.request.urlopen(req, timeout=timeout) as resp: # noqa: S310
return resp.read()
except urllib.error.HTTPError as e:
msg = f"HTTP error fetching {url!r}: {e.code} {getattr(e, 'reason', '')}".strip()
raise ValueError(msg) from e
except urllib.error.URLError as e:
msg = f"URL error fetching {url!r}: {getattr(e, 'reason', e)}"
raise ValueError(msg) from e
Copy link
Member

@ocefpaf ocefpaf Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cyclomatic complexity here is high for a simple fetch function. Also, requests is already in the dependency tree, one could make it a hard dependency and use that instead of bringing raw urllib calls. Last but not least, erddapy is already a hard dependency and can fetch ERDDAP data reducing a lot of the code complexity here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf
That’s a fair point — agreed.

I kept this as urllib initially to avoid introducing new hard dependencies, but I agree the function is more complex than it should be for what it does. Given that requests is already in the dependency
tree, switching to it would simplify both the code and error handling.

I also agree that leaning on erddapy is likely the cleanest approach for ERDDAP-specific access and would reduce this logic substantially. I’m happy to refactor in that direction or move this out of core entirely if that’s preferable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a fair point — agreed.

It is very tiresome for a human to keep reading AI generated comments. @tanmayrajurkar, can you use your own "voice" instead? It would be easier to understand your needs that way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf Noted. Thanks for pointing that out. I will keep my answers more brief and written in my own way.

@tanmayrajurkar
Copy link
Author

Requiring manual downloads
before running QC adds friction and hurts reproducibility.

Sure. While we do expect that data in an ERDDAP server was already QCed, that can happen and/or be part of someone workflow. However, the fetching data, or streaming, should be a function of an online app and not the core QC library. What do you think?

That’s a fair point — I agree that full fetching/streaming logic belongs in an application layer, not the core QC library.

My intent here was only a very thin convenience to resolve a dataset reference (local path vs remote endpoint) before handing off to the existing stream classes, not to turn ioos_qc into an online client.

If even that feels out of scope for core, I’m happy to move this logic out to an example/helper and keep ioos_qc strictly file/object-based.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants