-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache record names to avoid hitting class loader #219
Cache record names to avoid hitting class loader #219
Conversation
@@ -24,6 +24,8 @@ | |||
|
|||
public abstract class AvroSchemaHelper | |||
{ | |||
private static final Map<String, String> SCHEMA_NAME_CACHE = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about this.
Since it is static
it is per-ClassLoader, and ideally nothing in Jackson would use that. If at all possible this should be tied to something else, probably AvroFactory
(or AvroMapper
if practical).
Second, all caches should be bound (not with unlimited size). This is easy to solve by just using
com.fasterxml.jackson.databind.util.LRUMap
with whatever size settings (default, max size) makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also another significant problem is that access is not synchronized: since this will get called from multiple threads, and it is not read-only, access must be made thread-safe. LRUMap
uses ConcurrentHashMap
which solves that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question from a Java non-expert: will the call Class.forName(nestedClassName)
(called in the method that resolves the name) use the default class loader? If so, no matter how we cache the names, it will still be broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would use the default class loader of... well. It's probably the context class loader (alternative being class loader that loaded given class; usually these are the same but not always).
First of all, thank you for contributing this! I did not realize that this path could be used during actual serializations, results not cached (if I had, would have raised the concern). I have some concerns about implementation, added notes. I think those should be resolvable. I was also wondering about where to add this (2.11 vs 2.12). I suspect there is value in getting it in earlier (2.11.3 -- altho I really hope to get 2.12.0.rc1 out within 1 month now), and maybe there is low risk for issues. |
Not sure how to tie the cache to the I've tracked down all classes using these methods, and the good news is that only two classes use it directly: As this is a severe issue for us, I'm going to work on it. I just need some guidance as it does seem to fit in the current design. |
Hi @cowtowncoder, any comments? What about using a static |
@marcospassos If I remember correctly, JDK offerings for weak/soft reference handling is unfortunately weak. |
Ok, yes, this is very tricky. For a moment I thought much/most could be moved within I hope to look more into this but I concur, this is badly implemented at this point & tricky to uproot. I wish I realized exactly how nasty this is -- now I understand why performance consequences are disastrous: failing to find a class can be VERY VERY expensive, esp. on bigger classpaths. One thing that would be useful, if it was possible to obtain, would be to see specific path of calls, if profiler could give that information. There are quite a few callpaths to |
One other thought: I am beginning to think that the "solution" in #195 is wrong -- anything that has to do class existence lookups seems like a Wrong Solution to a problem. So I am not ruling out a possibility of actually reverting that change. |
@cowtowncoder Thank you again for the detailed review and support. I've pushed a commit that makes use of Please let me know if you need anything else before merging this. We are looking forward to seeing it included as part of the next release. |
@marcospassos thanks! I think I'll take this for 2.11, hoping things can be cleared up in future. :) |
@marcospassos I did some changes mostly just to isolate cache bit further and use separate cache key type (if optimization is needed going all in :-D ). I realize that testing of this part is difficult, but it probably would be good if you could at least check that performance improvement is as expected (my main concern is that I managed to do something silly to render caching not work). |
LGTM. We're going to monitor the IO metrics after the next release. I'll keep you posted. |
While I am in 100% agreement with this statement, Avro defined this behavior for matching nested classes in their reference implementation. :( It's a choice between not being compatible with nested classes when using an Apache 1.9 schema, or doing class existence lookups. ... and it looks like Apache also handled this problem by just caching the Schema -> resolved full name as well :/ |
The "Right" Solution would be to emit a |
case FIXED: | ||
String namespace = schema.getNamespace(); | ||
String name = schema.getName(); | ||
String key = namespace + "." + name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namespace
might be null
here (not when a schema is generated from a java class, but if using hand-crafted or schemas generated from avdl files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checked, schema.getFullName()
should be suitable as a cache key here without any additional checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there @baharclerode! Thank you for additional comments -- and yes, I realize that Avro's choice of "interesting" ways of dealing with stuff sort of forces our hand here.
Would be happy to merge a PR for removing helper class as key, use schema.getFullName()
if that should work?
One other related question: if we had a setting to select between "1.8-and-earlier" vs "1.9-and-later" modes, would it be possible to use different logic, avoiding Class lookup or at least alleviate it?
Or does the problem exist regardless of this, with 1.9 and above?
(I realize that right now lack of context makes it impossible to have such a setting, but I still hope we can make AvroSchemaHelper
become non-static and/or take some context object we control which could pass configuration, and ideally contain per-mapper (or per-schema or whatever) cache).
But it would introduce a non-standard attribute that can eventually break if Avro parser doesn't recognize it anymore, or am I missing something? |
It's a standard attribute used everywhere except record/fixed/enum schemas (maps, arrays, large numbers, etc.), for some reason. The Apache parser wouldn't recognize it at all on records, so it'd just ignore it and fall back to class lookups if you're using Apache to deserialize avro records. If Jackson emitted the property on every record/enum/fixed schema, you could maintain compatibility with deserializing all Apache 1.8 and 1.9 schemas, while optimizing the usecase where Jackson is being used both for schema generation and deserialization at the cost of bloating the schema size a bit. Every other combination would require fallbacks to class lookups (so we'd still need the cache). The benefit is that it creates a potential path forward (if we could get Apache to adopt the same behavior) of moving back to a world where the general case doesn't require class existence checks (they would become a special case when deserializing from a Apache 1.9 schema) |
In our case, we don't use schema generation, but I understand that it should work in the same way provided that we include the |
That would be the idea in such a hypothetical scenario. You could retroactively generate the property for any 1.8 schemas (or 1.9+ schemas if you're 100% sure there are no nested classes in use) and update the stored schemas for your data to add the property to bypass class existence checks, without breaking compatibility with anything. Which is still a big mess, but it's a workable mess for someone that has stored historical data they can't throw out. |
The other option is to look into adding some compatibility feature flags to the |
If I understand the idea correctly, I like it: additional metadata to avoid lookups in best case; but fallbacks for other cases. |
We use Jackson in our stream processing pipeline, and we noticed that the serialization process was consuming most of the CPU time. After some profiling, we tracked down the issue to the method which resolves the record's name. In the current implementation, for every record visited, the class loader checks if a class with the record's name exists, causing an IO bottleneck in high loads. By caching the result, we were able to boost our performance by 7000%.