Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache record names to avoid hitting class loader #219

Merged
merged 2 commits into from
Sep 19, 2020

Conversation

marcospassos
Copy link
Contributor

@marcospassos marcospassos commented Sep 11, 2020

We use Jackson in our stream processing pipeline, and we noticed that the serialization process was consuming most of the CPU time. After some profiling, we tracked down the issue to the method which resolves the record's name. In the current implementation, for every record visited, the class loader checks if a class with the record's name exists, causing an IO bottleneck in high loads. By caching the result, we were able to boost our performance by 7000%.

@@ -24,6 +24,8 @@

public abstract class AvroSchemaHelper
{
private static final Map<String, String> SCHEMA_NAME_CACHE = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about this.

Since it is static it is per-ClassLoader, and ideally nothing in Jackson would use that. If at all possible this should be tied to something else, probably AvroFactory (or AvroMapper if practical).

Second, all caches should be bound (not with unlimited size). This is easy to solve by just using

com.fasterxml.jackson.databind.util.LRUMap

with whatever size settings (default, max size) makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another significant problem is that access is not synchronized: since this will get called from multiple threads, and it is not read-only, access must be made thread-safe. LRUMap uses ConcurrentHashMap which solves that issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question from a Java non-expert: will the call Class.forName(nestedClassName) (called in the method that resolves the name) use the default class loader? If so, no matter how we cache the names, it will still be broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would use the default class loader of... well. It's probably the context class loader (alternative being class loader that loaded given class; usually these are the same but not always).

@cowtowncoder
Copy link
Member

First of all, thank you for contributing this! I did not realize that this path could be used during actual serializations, results not cached (if I had, would have raised the concern).

I have some concerns about implementation, added notes. I think those should be resolvable.

I was also wondering about where to add this (2.11 vs 2.12). I suspect there is value in getting it in earlier (2.11.3 -- altho I really hope to get 2.12.0.rc1 out within 1 month now), and maybe there is low risk for issues.
So we can go with 2.11 branch as target.

@marcospassos
Copy link
Contributor Author

marcospassos commented Sep 11, 2020

Not sure how to tie the cache to the AvroFactory or AvroMapper. This method is transitively used across several classes, including value-objects not related to any of these instances. An example is AvroSchema, instantiated directly in most cases.

I've tracked down all classes using these methods, and the good news is that only two classes use it directly: AvroReaderFactory and AvroWriteContext.

As this is a severe issue for us, I'm going to work on it. I just need some guidance as it does seem to fit in the current design.

@marcospassos
Copy link
Contributor Author

Hi @cowtowncoder, any comments?

What about using a static WeakHashMap where the class loader and the name cache are the keys and values respectively?

@cowtowncoder
Copy link
Member

@marcospassos If I remember correctly, JDK offerings for weak/soft reference handling is unfortunately weak.
Let me see if I can find a proper place for cache. This is the problem with static utility and other things without context; they can not really have state, including caching.

@cowtowncoder
Copy link
Member

cowtowncoder commented Sep 19, 2020

Ok, yes, this is very tricky. For a moment I thought much/most could be moved within AvroSchema but unfortunately cursed Avro lib type Schema is used instead so can not really stick it in there -- fundamentally all we need is a one-time decision of whether to use one full name or another, so if we could wrap Schema in something, consistently (and hold on to references to that wrapper), no shared caching should be needed.

I hope to look more into this but I concur, this is badly implemented at this point & tricky to uproot. I wish I realized exactly how nasty this is -- now I understand why performance consequences are disastrous: failing to find a class can be VERY VERY expensive, esp. on bigger classpaths.

One thing that would be useful, if it was possible to obtain, would be to see specific path of calls, if profiler could give that information. There are quite a few callpaths to getTypeId(Schema) and getFullName(Schema) but I suspect they are not equally problematic.

@cowtowncoder
Copy link
Member

One other thought: I am beginning to think that the "solution" in #195 is wrong -- anything that has to do class existence lookups seems like a Wrong Solution to a problem. So I am not ruling out a possibility of actually reverting that change.
That would not resolve the issue wrt 2.11 so perhaps static caching is a lesser evil, as long as size is bounded and both key and value are Strings. There is the need to use data struct that works in multi-threaded environment and com.fasterxml.jackson.databind.util.LRUMap works well.

@marcospassos
Copy link
Contributor Author

@cowtowncoder Thank you again for the detailed review and support. I've pushed a commit that makes use of LRUMap instead of HashMap. I didn't include tests because I don't see how we could test it, so I left it out.

Please let me know if you need anything else before merging this. We are looking forward to seeing it included as part of the next release.

@cowtowncoder
Copy link
Member

@marcospassos thanks! I think I'll take this for 2.11, hoping things can be cleared up in future. :)

@cowtowncoder cowtowncoder merged commit 80b8a19 into FasterXML:2.11 Sep 19, 2020
@marcospassos marcospassos deleted the record-name-caching branch September 19, 2020 21:33
@cowtowncoder cowtowncoder added this to the 2.11.3 milestone Sep 19, 2020
cowtowncoder added a commit that referenced this pull request Sep 19, 2020
cowtowncoder added a commit that referenced this pull request Sep 19, 2020
@cowtowncoder
Copy link
Member

@marcospassos I did some changes mostly just to isolate cache bit further and use separate cache key type (if optimization is needed going all in :-D ). I realize that testing of this part is difficult, but it probably would be good if you could at least check that performance improvement is as expected (my main concern is that I managed to do something silly to render caching not work).

@marcospassos
Copy link
Contributor Author

LGTM. We're going to monitor the IO metrics after the next release. I'll keep you posted.

@baharclerode
Copy link
Contributor

anything that has to do class existence lookups seems like a Wrong Solution to a problem.

While I am in 100% agreement with this statement, Avro defined this behavior for matching nested classes in their reference implementation. :(

https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L245-L262

It's a choice between not being compatible with nested classes when using an Apache 1.9 schema, or doing class existence lookups.

... and it looks like Apache also handled this problem by just caching the Schema -> resolved full name as well :/

@baharclerode
Copy link
Contributor

The "Right" Solution would be to emit a java-class (AvroSchemaHelper.AVRO_SCHEMA_PROP_CLASS) schema property on record schemas, and check/use that before doing a class lookup (Or only emitting that for nested classes, and assuming if it doesn't exist and the namespace doesn't exist, then it's a normal name, avoiding a class lookup in all cases); If you're in an ecosystem where you're using Jackson to generate the schemas and deserialize the records, that would avoid the class lookups altogether.

case FIXED:
String namespace = schema.getNamespace();
String name = schema.getName();
String key = namespace + "." + name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namespace might be null here (not when a schema is generated from a java class, but if using hand-crafted or schemas generated from avdl files)

Copy link
Contributor

@baharclerode baharclerode Sep 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checked, schema.getFullName() should be suitable as a cache key here without any additional checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there @baharclerode! Thank you for additional comments -- and yes, I realize that Avro's choice of "interesting" ways of dealing with stuff sort of forces our hand here.
Would be happy to merge a PR for removing helper class as key, use schema.getFullName() if that should work?

One other related question: if we had a setting to select between "1.8-and-earlier" vs "1.9-and-later" modes, would it be possible to use different logic, avoiding Class lookup or at least alleviate it?
Or does the problem exist regardless of this, with 1.9 and above?
(I realize that right now lack of context makes it impossible to have such a setting, but I still hope we can make AvroSchemaHelper become non-static and/or take some context object we control which could pass configuration, and ideally contain per-mapper (or per-schema or whatever) cache).

@marcospassos
Copy link
Contributor Author

The "Right" Solution would be to emit a java-class (AvroSchemaHelper.AVRO_SCHEMA_PROP_CLASS) schema property on record schemas, and check/use that before doing a class lookup (Or only emitting that for nested classes, and assuming if it doesn't exist and the namespace doesn't exist, then it's a normal name, avoiding a class lookup in all cases); If you're in an ecosystem where you're using Jackson to generate the schemas and deserialize the records, that would avoid the class lookups altogether.

But it would introduce a non-standard attribute that can eventually break if Avro parser doesn't recognize it anymore, or am I missing something?

@baharclerode
Copy link
Contributor

baharclerode commented Sep 20, 2020

non-standard attribute

It's a standard attribute used everywhere except record/fixed/enum schemas (maps, arrays, large numbers, etc.), for some reason. The Apache parser wouldn't recognize it at all on records, so it'd just ignore it and fall back to class lookups if you're using Apache to deserialize avro records.

https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L73-L75

If Jackson emitted the property on every record/enum/fixed schema, you could maintain compatibility with deserializing all Apache 1.8 and 1.9 schemas, while optimizing the usecase where Jackson is being used both for schema generation and deserialization at the cost of bloating the schema size a bit. Every other combination would require fallbacks to class lookups (so we'd still need the cache).

The benefit is that it creates a potential path forward (if we could get Apache to adopt the same behavior) of moving back to a world where the general case doesn't require class existence checks (they would become a special case when deserializing from a Apache 1.9 schema)

@marcospassos
Copy link
Contributor Author

In our case, we don't use schema generation, but I understand that it should work in the same way provided that we include the java-class attribute in the Avro schema, right? If so, indeed, it does seem to be the right path.

@baharclerode
Copy link
Contributor

That would be the idea in such a hypothetical scenario. You could retroactively generate the property for any 1.8 schemas (or 1.9+ schemas if you're 100% sure there are no nested classes in use) and update the stored schemas for your data to add the property to bypass class existence checks, without breaking compatibility with anything. Which is still a big mess, but it's a workable mess for someone that has stored historical data they can't throw out.

@baharclerode
Copy link
Contributor

The other option is to look into adding some compatibility feature flags to the AvroMapper whereby you could disable 1.9+ support to avoid the class existence checks if you're not using 1.9+ schemas.

@cowtowncoder
Copy link
Member

If I understand the idea correctly, I like it: additional metadata to avoid lookups in best case; but fallbacks for other cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants