-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(enriching): add memory
enrichment table
#21348
base: master
Are you sure you want to change the base?
Conversation
This adds additional VRL functions for reading and storing data into caches that can be configured in global options. Caches can store any VRL value and are meant to store data for shorter periods. All data gets TTL (time-to-live) assigned, based on cache configuration and gets removed when that TTL expires.
@jszwedko @pront This is a very rough draft of caching feature for VRL. The idea is to have separate storage for VRL values that can be used across different VRL runs. I have opted for a very simple interface, with just There are many things to consider for this:
I hope this feature does not break some of the promises VRL gives. Let me know your thought on this and if you think this feature makes sense. If you think it does, let me know if this interface fits VRL well, or if I need to figure something else out. |
(chiming in here since this is for some of the work that we need to have done, and esensar and I have talked about this concept offline) Cache deletions: would this just be writing to a key with an empty object to imply a deletion, or should there be an explicit cache_delete that can be called? Monitoring: One of the other things that is probably a requirement for this would be monitoring. Some ideas that could be discussed:
Concurrency: It seems that the interval for concurrency would be very very short if there is a lock on the object for only as long as it takes to read or write to the memory space. The user would just have to understand that cache data may change/vanish between a read and a write, or between two reads. For our purposes, the cache is to prevent multiple "heavy" events from happening. Each event predictably produces the same outcome for some reasonably long period of time, but they do change state eventually. (Our environment: DNS data.) Cache updates: having TTL be "visible" for cached objects within VRL is necessary, since I can envision a rather crude predictive cache update method that uses randomness to refresh a cache item, based on the TTL. Otherwise, if a highly-used cached object expires, then there may be many threads trying to update it at the same time which would be very wasteful and probably a heavy load that is not desirable (after all, a cached object implies that the cache is faster than some other method.) Better to apply predictive decline to the chances that an object will be refreshed. |
Regarding the cache deletions, I think explicit function is better ( When it comes to TTL updates, would it make sense for reads to also update TTL, because I guess the idea of TTL is to avoid storing data that is not needed and frequent reads would mean that the data is still used. |
This is a great point, we should create metrics for this. I would also argue that we should use quote as proof for the perf gains from this optimization and add them to the PR description. |
Sure - I have no opinion on the method for deletions, other than it needs to be possible somehow by events within VRL contexts.
I would strongly disagree here, or this would be a separate TTL. A cache entry has an expiration because it is "volatile" data that reduces in accuracy over time, and needs to be refreshed at the end of the TTL regardless of how many times it has been used. If there is a "last read" TTL and a "expiry TTL" that would be useful and I can see sometimes that both would be useful, but we cannot combine them into a single TTL that gets refreshed/reset upon read. |
Right, I haven't thought about that. It makes sense to just have a single TTL, but with additional VRL function to read it, to be able to better control it. |
Is it necessary or desirable to have an additional VRL function to read the TTL, or should it somehow happen at the exact moment of the "cache_get"? I think it could create race conditions if another function is required, since then the fetching of the cached object itself and the TTL would not have the same timestamp. "cache_get(,[,ttl_variable])" might work,perhaps? so then ttl_variable would be an object which would be set to the value of the TTL, and the user could define what the name of that object was. I'm not sure that's the best way to do it, but I think simultaneous fetching of the cache object AND setting the TTL is a good idea. |
Hey, I did a quick pass as promised. A general comment here is that the proposed change semantically is more like a "global VRL state" vs a "VRL cache". @fuchsnj also made the following point:
So back to the original problem, I would expect a caching solution to be hidden from the users i.e. no new VRL functions. For example, imagine that the following line |
I think it can be both. From my view, the "cache" part comes from (I believe) the concept that objects in the store have a timer that can be applied, and there is a clearing process that occurs separately from any event processing that will delete objects in the store whose timers have gone below zero. True, an object which has an exceptionally long TTL which is longer than the expected runtime of the system (or perhaps which have a special high value that is treated as infinite) would therefore be treated as permanent items, and so become "state" instead of "cache."
I'm not understanding quite how this would work. Somehow you'd have to indicate that the "vrl_function_foo" action would look at the cache, instead of doing whatever that function does in the first place. This would mean either some sort of tagging (?) to turn on or off a "look at the cache for every function from here on" method, or would mean universally applying caching to all functions, which would perhaps be useful but possibly exceptionally wasteful in memory space for things that were not desired to be cached. Imagine where "vrl_function(1,'a')" where "a" was a random string of 50 characters that may or may not ever appear again, and the event stream is 100k per second, and the result of the function was 900 bytes, and those 900 bytes change every 2 hours. This is essentially what we're trying to solve. The proposed method using new functions would allow very specific values to be inserted into or looked up in the cache, allowing for very focused scope of memory use, and also giving more granular control over including/not including certain things on a per-event basis. Other than using new VRL functions, the only other way I could see this working with maximum transparency would be to use specific "magic" object namespace prefixes to indicate cached data, but that is not very clean though I suppose I haven't thought about it enough.
I'm not sure how a timed cache function would ever be expected to be reliably idempotent, since at some point the TTL will expire, the cache value will be removed and/or updated to a different result, and the result may differ between iterations of examination. The user would need to understand this, and make accommodations for no data appearing in the cache, so that the heavy or slow function would be then called, and the result (hopefully) stored in the cache for subsequent references over the next time window of TTL. The original intention of this "cache" is that items which are relatively slow to access and which may have different values across some window of time that may be faster accessed in a memory store that is more lightweight than the function that generates them. There is an implicit understanding in such timed cache models that the value stored in memory is "almost as good" as the computationally costly or slow function which is used to insert the object in the cache, but that over time that value diminishes until a threshold where the item is expunged or refreshed. Our use case requires TTL, because we sometimes will see items inserted in the cache which will only be accessed for a few times over a few minutes, and then never again. If we cannot have those items removed automatically after a time period (regardless of how many times they are used or not used) then this is effectively a catastrophic memory leak. They also need to be refreshed on occasion, as the data loses accuracy over time. |
Maybe we can think of a better name for it, I agree that VRL cache might be misleading.
But in this case, we are directly accessing the storage in Rust code, which is behind In general I think these 2 are addressing different problems. I think proper VRL caching as you described it is a bigger undertaking, because we would probably need to think about when that caching makes sense, since many of the VRL functions are very fast and would probably take a hit from cache lookups instead of getting a speedup, so we would either have to selectively apply it to some functions, or provide a way to configure it (although then that wouldn't be hidden from the users and would probably be confusing to configure). Does adding some kind of global state to VRL (optional, it would have no effect unless user specifically calls these functions) make sense for you? Does it break any of the promises VRL makes? There is still a lot of work to be done for this PR, so I would just like to know in advance. If something like this is not an option for VRL, we can think about other solutions for slow functions, something a bit more hidden from the user (but I think it would always have to provide at least some configuration options, to ensure users can control the size of the cache). |
If we want to pursue this idea of global state, we might benefit from an RFC. If there are other ideas that do not require such a big change we can probably avoid the RFC. In both cases, some perf stats will make a more compelling case. |
Ideally we would build on enrichment tables which is currently used as external state for VRL. Looping in cc @lukesteensen for future discussions. |
This is a very interesting feature, but I'm wondering about the use-cases it is solving to make sure it is the best solution for those use-cases since it diverges from one of VRL's design princples that calls should be non-blocking. Could you describe the use-cases you have for this feature @esensar @johnhtodd ? I think that would help us identify if adding these functions to VRL is the correct approach or if a separate transform or enrichment table would be better suited. |
Yes, as @pront mentioned, I believe that extending enrichment tables is the better path here. At a high-level, there are some important characteristics of VRL that we want to maintain:
Introducing shared mutable state between VRL invocations would complicate these quite a bit. Instead, I think it would be better to separate writes from reads by putting them in different components. This dramatically simplifies the data flow and makes it clearer that state is being shared. One way to do this would be to introduce a new component that is basically both a sink and an enrichment table. It would look similar to the |
Alright, that makes sense to me. The separate sink into an enrichment table could be feasible for this. That would mean a new kind of enrichment table, which would be stored in memory instead of files. I will try that and see if something like that would work instead of this. Thanks for taking the time to review this. |
Thinking about this a bit more, is this solution preferred just due to simplified data flow? I think we still have the same issue about sharing mutable state, it is just no longer from different VRL invocations, but different components. If I understood that solution correctly, something like this would be implemented:
Now, when it comes to writing to that table, some kind of a lock would still have to be utilized (or maybe there would be a way to do it lock-free, but I guess that would have some other limitations). Does this sound right @lukesteensen ? |
@esensar Yes, it sounds like you have the right idea. One small point of clarification (which you may already know) is that it would only be one new component, a new enrichment table type, which would behave in some ways like a sink (i.e. it would accept input events from other components), but wouldn't actually involve creating a new type of sink. It would involve some work in our topology code to support hooking up this enrichment table to other components, likely mirroring the logic for sinks. And yes, you're right that it is still fundamentally shared state. The difference is that now we have exactly one writer (the new enrichment table component) and all VRL components are purely readers, which maintains the property that VRL scripts can be run many times, in any order, concurrently, etc, without changing the results (i.e. they behave more like pure functions without side effects). This will make it easier to avoid unexpected behavior and high levels of contention on the shared state, and implement/maintain optimizations around how we schedule VRL scripts to be run. One potentially useful library for implementing this would be evmap, but I'm sure there are others as well. Constraining ourselves to a single writer makes the design compatible with some of these data structures that have desirable properties. |
Thank you. Alright, I was initially going to add a new sink, but that approach makes sense. Thanks for the |
This implementation is based on `evmap`, for "lock-free" reading and writing. There is still a lock when data is refreshed, but that can be controlled, to have less interruptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty reasonable to me at a high level! I will probably need to defer to @pront for a more detailed review since I have not worked on Vector day-to-day in quite a while now. You have managed to run into quite a few limitations of the design, so apologies for that and thank you for your patience 😅
Re: Value
being memory inefficient, if I remember correctly this has something to do with the overhead of BTreeMap
for maps with very few keys. I believe HashMap
was better but we did not want to randomize key order. There was some talk about switching to IndexMap
which should be more efficient and still keep a deterministic (though still unsorted) key order, but I don't think we ever got around to testing that out. Something for @pront to think about 😄
I think also the fact that it is an enum contributes too - that is great for processing speed I think, but it takes up more space, since it takes up the size of the biggest variant, right? Either way, I think JSON serializing it is okay too and it is not really exposed outside of the component, so it can be easily changed up later on without breaking existing usages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some editorial suggestions!
Co-authored-by: May Lee <[email protected]>
Thank you! I have applied your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review pass. Looking good.
} | ||
} | ||
|
||
pub fn as_sink(&self) -> Option<SinkOuter<T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Let's add a comment summarizing the limitation's Luke mentioned above.
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a basic "How it works" section, you can just copy/paste the config you used for testing and sample outputs.
memory
enrichment table
src/topology/running.rs
Outdated
@@ -419,15 +419,37 @@ impl RunningTopology { | |||
let remove_sink = diff | |||
.sinks | |||
.removed_and_changed() | |||
.map(|key| (key, self.config.sink(key).unwrap().resources(key))); | |||
.map(|key| (key, self.config.sink(key).unwrap().resources(key))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to remove all new .unwrap()
s from this file. I think in most cases we can just chain an empty collection (as you refactor this maybe some can be converted into errors). But we want to avoid new potential panics as much as possible.
Co-authored-by: Pavlos Rontidis <[email protected]>
This adds a new type of
enrichment_table
-memory
, which can also act as a sink. It can store any VRL value and are meant to store data for shorter periods. All data gets TTL (time-to-live) assigned, based on cache configuration and gets removed when that TTL expires. Data is stored into the table by using it as a sink.