Scripted loading of data found below a URL #4267

philrz · 2022-12-12T19:46:47Z

At the time this issue is being filed Zed is at commit 313c4d4.

We recently noticed a tweet where a user pointed to some Python (archived as azure.ipynb.txt.gz in case the Gist should disappear) which is downloading and prepping-for-query a list of CSV files all accessible under a single URL prefix. Given Zed's flexibility, we recognized that the language should be capable of doing the same with much less code, e.g.:

Fetch the HTML at the top-level URL
Massage it into a list of URL of data files to be loaded
Feed those URL strings into an operator that does a GET on each URL
Apply shaping as necessary
Load the data into a pool

Ultimately this is likely to require some enhancements along the lines of things we've already discussed, e.g., a load operator within the language. Having spent a few quick minutes looking at the data from this specific tweet, a couple other things I spotted:

Some of the file URLs return HTTP 404 when accessed, so error handling will need to be covered
The Python already includes tricks to skip headers in the CSV that otherwise break our reader, so we'd need to do the same

The text was updated successfully, but these errors were encountered:

philrz · 2024-11-15T00:01:20Z

Recent changes in the linked PRs are getting this close to working. This is not yet in a GA release but rather is available at the tip of main as part of the wider changes happening in the SuperDB transition.

For instance, the following starts from the list of filenames shown in the HTML if you browse to https://github.com/brimdata/zed-sample-data/tree/main/zeek-default. The HTML is parsed to isolate just the relative paths of each file, then each one is retrieved and all the downloaded records are counted.

$ super -version
Version: v1.18.0-153-gf1213fa5

$ super -c '
from https://github.com/brimdata/zed-sample-data/tree/main/zeek-default format line
| grep("\"payload\"")
| replace(this, "</script>", "")
| regexp_replace(this, /.*script.*>/, "")
| yield parse_zson(this)
| over payload.tree.items
| from eval(f"https://github.com/brimdata/zed-sample-data/raw/refs/heads/main/{path}")
| count()'
1474104(uint64)

However, the original goal specifically of loading data isn't yet possible with the lake, as even the query above is currently prevented from running with super db query.

$ super db query '
from https://github.com/brimdata/zed-sample-data/tree/main/zeek-default format line
| grep("\"payload\"")
| replace(this, "</script>", "")
| regexp_replace(this, /.*script.*>/, "")
| yield parse_zson(this)
| over payload.tree.items
| from eval(f"https://github.com/brimdata/zed-sample-data/raw/refs/heads/main/{path}")
| count()'
https://github.com/brimdata/zed-sample-data/raw/refs/heads/main/zeek-default/analyzer.log.gz: cannot open in a data lake environment

Will continue to hold this issue open until that starts working.

philrz mentioned this issue Aug 14, 2023

Allow "from" operator to accept upstream input #4752

Closed

This was linked to pull requests Nov 14, 2024

allow from operators to have parents for robot scans #5437

Merged

change "from [expr]" syntax to "from eval(expr)" #5476

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripted loading of data found below a URL #4267

Scripted loading of data found below a URL #4267

philrz commented Dec 12, 2022

philrz commented Nov 15, 2024

Scripted loading of data found below a URL #4267

Scripted loading of data found below a URL #4267

Comments

philrz commented Dec 12, 2022

philrz commented Nov 15, 2024