-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing Avro OCF #762
Comments
Hey @siredmar, yeah I would imagine this isn't possible without some crazy trickery. We could potentially support this as an input codec format which would work with file based inputs ( |
What to you think would be the best approach? Case |
I think the codec approach is definitely the better option. The interface of a codec implementation consumes an |
Hi I have a use case where we have a large volume of OCF to process and this would be extremely useful. I looked at /internal/codec/reader.go, and from what I understand the main things needed to add a codec for the file input would be: Is this correct? |
By all means, all PRs are welcome! You're looking in the right place. Search for example for Not sure which AVRO package you want to try, but LinkedIn's goavro is used for |
I think I have something that follows the pattern in the other codecs using goavro.TextualFromNative(). Trying to run make, but running into issues: go: finding cloud.google.com/go/pubsub v1.17.1 I have go 1.18 installed, do I need to downgrade to 1.16 or is the build failing for some other reason? Here's what I get when I try to manually 'go get github.com/Azure/azure-sdk-for-go/sdk/azcore' go get -x github.com/Azure/azure-sdk-for-go/sdk/azcore get https://proxy.golang.org/github.com/%21azure/azure-sdk-for-go/sdk/azcore/@v/listget https://proxy.golang.org/github.com/%21azure/azure-sdk-for-go/sdk/azcore/@v/list: 200 OK (0.476s)build github.com/Azure/azure-sdk-for-go/sdk/azcore: cannot load github.com/Azure/azure-sdk-for-go/sdk/azcore: no Go source files |
Not sure why you're getting that. I'm on Go 1.18 and I have the following env vars set:
If all else fails, try building it in a |
Apparently it's what happens when you upgraded your Windows Go install to 1.18 but your WSL Go install is still on 1.13 🤦 I only had make installed in WSL but I rarely use it so didn't notice till now. I ran a test that read an OCF and output to file lines codec, it's spitting out the data with some invisible UTF-8 characters, not json. I think I need to use a different codec method and json.Marshall; although that makes me wonder if there's a use case for reading binary avro data and passing on the avro schema as input metadata so a processor can mutate it and use it for output. I should probably get this working first though. |
I looked at the schema_registry processor implementation, realized that the avro-ocf input should do the same thing but from a file, using part. SetJSON(). |
I guess it can be when dealing with really complex and heavily-nested structures.
I don't know anything about AVRO OCF, but is that a meaningful operation then? Maybe such a file is not meant to be converted to JSON. Note that the only reason to convert it to JSON is to allow users to define transformations for it. If it's such a complex and large object, then the Benthos way of performing transformations might not be ideal. OTOH, does it perhaps contain multiple records inside, like a really big JSON array? If this is the case, then maybe the new codec should behave like the Still, 130MB expanding to 5GB sounds excessive to me. Maybe you bumped into some bug in the goavro library. Do you have a tool which can extract its schema somehow? If the output produced by goavro is complete nonsense, I'd recommend creating a simple self-contained app which reproduces the issue by calling goavro functions directly and then raise an issue upstream on the goavro repo. Also, it might be worth trying the same thing with https://github.com/hamba/avro to see if it behaves the same way. |
So the use case is that we get logs shipped to us from a third party that don't partition well in BigQuery, the datetime field is nested and not top-level; so far we've been dealing with this by doing a load to a temporary table then appending "SELECT CAST(nested.field AS TIMESTAMP) event_time, * from temptable" to the production table. This incurs scanning costs just to ingest the logs into BigQuery. |
Hi!
Am i right that currently it is not possible to process Avro OCF (Object Container Files, http://avro.apache.org/docs/current/spec.html#Object+Container+Files) with benthos?
The main difference here is, that the schema is within the header an can be used to decode the message.
Cheers
Armin
The text was updated successfully, but these errors were encountered: