Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I would love a way to create a Writer without mutable borrow of SerializerConfig #14

Closed
droundy opened this issue Jun 12, 2024 · 4 comments · Fixed by #20
Closed

I would love a way to create a Writer without mutable borrow of SerializerConfig #14

droundy opened this issue Jun 12, 2024 · 4 comments · Fixed by #20

Comments

@droundy
Copy link
Contributor

droundy commented Jun 12, 2024

My use pattern involves creating a new Writer every minute (for a new file), and each writer remains open for two minutes so we can write to the "old" minute if some delayed messages come in.

I'm finding the lifetimes for this very challenging to manage with the pattern of requiring a &mut SerializerConfig. I can appreciate the value of reusing buffers when serializing a single datum, but creating the buffers once per minute would be no problem if Writer were to own its buffers rather than borrowing them.

I don't have a great idea for an elegant API change unless you were willing to go all the way and make Writer always own its buffers, which would be a breaking change.

@Ten0
Copy link
Owner

Ten0 commented Jun 12, 2024

Hi!
Thanks for opening this issue.
The need for a ~'static Writer is something I have considered, and imagined a design for. However I figured it would be built on top of the current API, (and probably outside of serde_avro_fast until that shows to be a somewhat common need).

Here are the key elements why I like the current design for the low-level API:

  • I want the most performant API to always be available, and I would also like for it to be the most natural.
  • The &mut SerializerConfig makes buffer reuse work very naturally and easily even across errors. It makes the most natural writing also the most performant. If the buffers were owned by the serializer that would be significantly more boilerplate-ish, in particular for single-datum functions, and in particular if handling errors (see the usage docs before 25384fa).

I agree that for the writer this is less relevant: instantiating a writer is not meant to happen too often, so buffer reuse across writer instanciations is less critical. (That being said I do like that it's available.)

Now since the Writer is built on top of the same SerializerState that takes in &mut SerializerConfig until very deep in the serialization module (which I don't think we want to change because it looks like that would significantly worsen the low-level API), and the object_container_file_encoding Writer contains the SerializerState, for the writer to not depend on SerializerConfig, it would have to hold the SerializerConfig in addition to the &mut SerializerConfig, so the design would be to have it be self-referential.

Now if we do this, because self-referential is otherwise very non-trivial to get sound, this means we probably would want to use the ouroboros crate for this.

So doing it in serde_avro_fast itself would have two downsides:

  1. It would add the ouroboros crate to our dependencies, which is by nature not going to feel like the cleanest, most stable or most reliable of dependencies (especially considering that there is still active work in the compiler around self-references support), and I also fear that it might be heavy (with the proc macro, the fact that it would need to always allocate even if using the outside config...)
  2. We'd probably open an API like Writer::with_inner_config(), which I fear may become more natural to use than its slighly better counterpart that takes a ref to the config (although with the main one being called new and this one being called with_inner_config it would probably be ok).

Feature-flagging it looks like it solves both these issues, but maybe at the cost of maintaining a part of the writer code twice with significant differences conditionally on whether the feature is enabled (due to the necessity of doing with_writer_mut(|writer| ...) with ouroboros, and also that would leave us with a dangling feature flag if we move out from ouroboros.

If a user needs to do this, it looks like they could implement it themselves using ouroboros:

#[ouroboros::self_referencing]
struct Writer<'schema> {
serializer_config: SerializerConfig<'schema>,
#[borrows(mut serializer_config)]
#[covariant]
writer: serde_avro_fast::object_container_file_encoding::Writer<'this, 'schema, Vec<u8>>,
}

(this seems reasonably non-boilerplate-ish as long as their writer has a stub value)

So overall it looks like the feature would make sense, it's probably ok to have, but I'm unclear on whether this should be a feature flag or not, and whether this is a common enough use-case that we want to do the implementation as part of serde_avro_fast instead of letting the user manage the self-referential part themselves.

What do you think?

@droundy
Copy link
Contributor Author

droundy commented Jun 12, 2024

Thanks for pointing me at ouroboros. I've been skeptical of that crate (and the other self-referential structure crates, but maybe that's a way for me to move forward with switching to serde_avro_fast, which I'd like to do.

In terms of future directions, I'd tend to suggest (for next breaking change) to somehow separate the buffer from the config, and pass the buffer as a separate argument rather than storing a reference internally. It sounds like a somewhat painful change, but in general holding a &mut within a struct is quite problematic, so if I were you I'd go far out of my way to avoid that.

Using ouroboros internally does not sound to me like a great plan, even if it is easy. Surely it wouldn't be that hard to at least make a non-exported API that takes a separate &Schema and &mut Buffers as inputs, so the essential code could be shared between both paths. And then maybe an ordinary Config that only holds configuration and not actual buffers, and could even be Copy.

@Ten0
Copy link
Owner

Ten0 commented Jun 12, 2024

I'm struggling to get a clear idea of the design you're suggesting.

I'd tend to suggest (for next breaking change) to somehow separate the buffer from the config, and pass the buffer as a separate argument rather than storing a reference internally [because] in general holding a &mut within a struct is quite problematic

Do you mean owned in the low level API (in SerializerState), as was done before 25384fa? If so, doesn't it look like that would significantly worsen the API around buffer reuse that was updated in that commit?

If the buffers were owned by the serializer that would be significantly more boilerplate-ish

takes a separate &Schema and &mut Buffers as inputs

Or do you mean mutably borrowed in the low level API (in SerializerState)? If so, I don't understand how that solves the issue we currently have with &mut SerializerConfig. It looks like we'd just have the same issue with &mut Buffers as we currently have with &mut SerializerConfig, only the user would have both SerializerConfig and Buffers to manage as part of the public API.
Specifically, how one writes this line:

serializer_state: SerializerState<'c, 's, Vec<u8>>,
if the buffers (or config) are meant to be stored owned in the Writer but SerializerConfig holds &mut Buffers/&mut SerializerConfig is the issue in both cases.

Alternatives I see are:

  • Rebuild the SerializerState at every call to Serialize (sad because that means either we got an additional deref level at every write (if we use &mut Vec<u8> as W) or we need to std::mem::swap() the full Vec<u8> every time, and swap it back at the end, probably using some PutBack kind of struct with a destructor so that it's put back no matter the way we exit => heavier and ugly
  • Put an enum in the SerializerState that says whether it owns the buffers or &mut the buffers -> I'm not sure which is best between the branching at every usage + extra struct size or having the enum variant for the owned version contain a Box here, either way feels a bit dirty in the SerializerState, but maybe that's the most practical approach.

@Ten0
Copy link
Owner

Ten0 commented Jun 14, 2024

Yeah the more I think about it the more I think the best option for this would be to have SerializerState contain

enum SerializerConfigRef<'c, 's> {
    Borrowed(&'c mut SerializerConfig<'s>),
    Owned(SerializerConfig<'s>), 
}

(possibly boxed in the owned case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants