Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect() and union() skipping null values #5039

Open
philrz opened this issue Feb 16, 2024 · 1 comment
Open

collect() and union() skipping null values #5039

philrz opened this issue Feb 16, 2024 · 1 comment
Labels
bug Something isn't working community

Comments

@philrz
Copy link
Contributor

philrz commented Feb 16, 2024

Repro is with Zed commit 4dc6236.

While helping a community Slack user I bumped into the following surprising result:

$ zq -version
Version: v1.14.0-2-g4dc62369

$ echo '1 2 null 4' | zq -z 'collect(this)' -
[1,2,4]

I would have expected the null value to still be in the array created.

Presumably due to the same effect, here collect() is not even returning an array.

$ echo 'null null' | zq -z 'collect(this)' -
null

union() has similar behaviors.

$ echo '1 2 null 4' | zq -z 'union(this)' -
|[1,2,4]|

$ echo 'null null' | zq -z 'union(this)' -
null

If we decide the current behavior is intentional and something we intend to keep, I'd suggest we start mentioning it in the docs with an explanation.

@philrz philrz added bug Something isn't working community labels Feb 16, 2024
mattnibs added a commit that referenced this issue Feb 19, 2024
mattnibs added a commit that referenced this issue Feb 19, 2024
@philrz philrz changed the title collect() skipping null values collect() and union() skipping null values Feb 20, 2024
@philrz
Copy link
Contributor Author

philrz commented Feb 20, 2024

#5041 was opened to change the behavior so collect() would preserve these values, but then we discussed it as a team and have decided to hold off on merging anything yet. Some things that were touched on in the discussion:

  1. There was consensus that if we change this for collect() we should do the same for union() so they're consistent.

  2. One reason for hesitation on merging collect(): Allow null values #5041 is that this change is significant enough that it might break some programs, e.g., those of a particular community zync user that has some particularly large/complex Zed that relies heavily on collect() at times. If Zed were more mature it would probably be appropriate to do a careful/slow transition, e.g., maintain the current behavior with some kind of flag that becomes a new default, then surface a warning to users whose programs are relying on that behavior because we intend to swap the default in a future version, since this would give them time to understand and react to the change. Instead we discussed helping them to audit their programs for possible exposure and also giving a heads-up to the wider community via Slack before switching the behavior. But then we identified other behaviors with aggregate functions we might want to consider, so we concluded that we should gather up our thoughts on all those changes and then figure out a transition plan.

  3. As an example of "other behaviors", @mccanne cited avg() as an example, since here null values are currently treated as "not present", whereas if other aggregate functions like collect() and union() are changed to start preserving the null values, users might now expect a different behavior from avg() too, e.g., treating them as "present" and implying some meaningful value such as 0, which would change the result.

$ zq -version
Version: v1.14.0-2-g4dc62369

$ echo '1 3 5 7' | zq -z 'avg(this)' -
4.

$ echo '1 null 3 5 7' | zq -z 'avg(this)' -
4.

  1. @mattnibs pointed out that if we did toggle the behavior, it would be easy for users to get back the current behavior using the where clause, such as can be seen with his branch from collect(): Allow null values #5041:
$ zq -version
Version: v1.14.0-3-ga33df054

$ echo '1 2 null 4' | zq -z 'collect(this)' -
[1,2,null,4]([(int64,null)])

$ echo '1 2 null 4' | zq -z 'collect(this) where this != null' -
[1,2,4]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community
Projects
None yet
Development

No branches or pull requests

1 participant