Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Values derived from exec'ed command lines #5038

Open
philrz opened this issue Feb 16, 2024 · 0 comments
Open

Values derived from exec'ed command lines #5038

philrz opened this issue Feb 16, 2024 · 0 comments

Comments

@philrz
Copy link
Contributor

philrz commented Feb 16, 2024

tl;dr

A community user inquired about if a Zed program could execute a command line script to acquire values for further use in that program. As described, it's in some ways a variation of what's captured in #4752 and indeed the request came from the same user that initiated that request.

Details

At the time this issue is being opened, Zed is at commit 4dc6236.

In the user's own words in a Slack thread:

Is it possible to run a map reduce operation with Zed? What I'd like to do is iterate through each row in a CSV and execute command line script to populate missing column(s)

Naively one might look for help from the shell by trying something like what's below, e.g. in an attempt to put a random number in place of null values:

$ zq -version
Version: v1.14.0-2-g4dc62369

$ cat has_nulls.csv 
one,two
1,
,2

$ zq -Z "over this => ( value := value==null ? $(echo $RANDOM) : value | collect(this) | unflatten(this))" has_nulls.csv
{
    one: 1.,
    two: 15732
}
{
    one: 15732,
    two: 2.
}

Of course, this didn't work as hoped (i.e., we got the same "random" value twice) because the shell only populated the $() once when the command line was invoked. So indeed this would need to be new Zed functionality.

Upon asking the user for more detail on their use case, they added:

The use case here is basically anytime I want a column populated by more complex logic. Right now the only way to do this is to parse the JSON or CSV file with Python, read it into memory, run the operation, and then output the file back to disk. It would be great to pass a JSON representation of the given row into a shell script and have the resulting output populate a field.
I built something to do this using Datasette. It's not great and pretty hacky, but articulates the idea a bit further: https://github.com/iloveitaly/datasette-enrichments-shell
I wrote up a blog post that explains my particular use case a bit more: https://mikebian.co/p=2400&preview=1&_ppp=66d1d5be15
Really, this is just a variant of the API use case (#4752). A shell script would be way more flexible and it could call the API and do whatever it needed to do.

Design thoughts

We had some preliminary internal discussion about this idea already and agree that it would be cool but there's some details to be ironed out.

For instance, having Zed call out to the shell could be seen as a security problem, e.g., if some attacker posts "useful" Zed programs online that have these shell callouts hidden sneakily inside of them that end up deleting or harvesting sensitive data from the unsuspecting user's system. So, if implemented, it seems it would probably need to come with some guardrails to minimize this kind of exposure. Then even if a user opts into its use somehow, even then it seems like it would mostly (only?) be applicable for the zq use case, since in a lake service setup the Zed code executes server side, so letting clients execute arbitrary shell commands on a remote server seems like something that most environments would want to always disallow, so, guardrails again. And in terms of design principles, we generally try to minimize Zed enhancements that work in one context (e.g., zq) but aren't available in others (e.g., lake service) but given the use case it might be worthy of making an exception here. This brings yet another possibility to mind: Something like a Zui user working in a "local lake" context might still want to opt into allowing this functionality since the "server side" callouts are happening on their same workstation from which they're acting as a client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant