Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripted loading of data found below a URL #4267

Open
philrz opened this issue Dec 12, 2022 · 1 comment · Fixed by #5437 or #5476
Open

Scripted loading of data found below a URL #4267

philrz opened this issue Dec 12, 2022 · 1 comment · Fixed by #5437 or #5476

Comments

@philrz
Copy link
Contributor

philrz commented Dec 12, 2022

At the time this issue is being filed Zed is at commit 313c4d4.

We recently noticed a tweet where a user pointed to some Python (archived as azure.ipynb.txt.gz in case the Gist should disappear) which is downloading and prepping-for-query a list of CSV files all accessible under a single URL prefix. Given Zed's flexibility, we recognized that the language should be capable of doing the same with much less code, e.g.:

  1. Fetch the HTML at the top-level URL
  2. Massage it into a list of URL of data files to be loaded
  3. Feed those URL strings into an operator that does a GET on each URL
  4. Apply shaping as necessary
  5. Load the data into a pool

Ultimately this is likely to require some enhancements along the lines of things we've already discussed, e.g., a load operator within the language. Having spent a few quick minutes looking at the data from this specific tweet, a couple other things I spotted:

  • Some of the file URLs return HTTP 404 when accessed, so error handling will need to be covered
  • The Python already includes tricks to skip headers in the CSV that otherwise break our reader, so we'd need to do the same
@philrz
Copy link
Contributor Author

philrz commented Nov 15, 2024

Recent changes in the linked PRs are getting this close to working. This is not yet in a GA release but rather is available at the tip of main as part of the wider changes happening in the SuperDB transition.

For instance, the following starts from the list of filenames shown in the HTML if you browse to https://github.com/brimdata/zed-sample-data/tree/main/zeek-default. The HTML is parsed to isolate just the relative paths of each file, then each one is retrieved and all the downloaded records are counted.

$ super -version
Version: v1.18.0-153-gf1213fa5

$ super -c '
from https://github.com/brimdata/zed-sample-data/tree/main/zeek-default format line
| grep("\"payload\"")
| replace(this, "</script>", "")
| regexp_replace(this, /.*script.*>/, "")
| yield parse_zson(this)
| over payload.tree.items
| from eval(f"https://github.com/brimdata/zed-sample-data/raw/refs/heads/main/{path}")
| count()'
1474104(uint64)

However, the original goal specifically of loading data isn't yet possible with the lake, as even the query above is currently prevented from running with super db query.

$ super db query '
from https://github.com/brimdata/zed-sample-data/tree/main/zeek-default format line
| grep("\"payload\"")
| replace(this, "</script>", "")
| regexp_replace(this, /.*script.*>/, "")
| yield parse_zson(this)
| over payload.tree.items
| from eval(f"https://github.com/brimdata/zed-sample-data/raw/refs/heads/main/{path}")
| count()'
https://github.com/brimdata/zed-sample-data/raw/refs/heads/main/zeek-default/analyzer.log.gz: cannot open in a data lake environment

Will continue to hold this issue open until that starts working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant