-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use thread local state for high compressor #47
base: master
Are you sure you want to change the base?
Conversation
Use thread local state for high compressor, this can reduce one memory allocation and free for each compress call to one memory allocation per thread.
Rebase the pull request #46 |
I'm against merging this. I tried the same thing (thought with a different caching approach) around a month ago and found it would cause a huge performance penalty. (like 3x slower) I used a global cache for those hash table (linnaea/MineBackupDelta@5191b8be93c0d15dc64d537b6395bfdfafd79e1f) and reverted it for the native compressors later (linnaea/MineBackupDelta@a3344cf4937833a9f54968e7343f3eb56c842fbd). And the benchmark I did here showed that using ThreadLocal instead of the global cache would only improve performance by around 2%, so the cache itself should not be the bottleneck, or maybe I just got everything wrong. |
I actually don't have concrete data to back this. It started by my other part of code that's totally in C++ and has nothing related to Java. That code calls compress many many times to compress huge amount of pages.This way, there will be many memory allocation calls. Also the compressions happens in many threads concurrently. Under such scenario, the lock contention inside libc's allocator poses high overhead. Using jemalloc on the other hand has less lock contention, but makes a lot of madvise call to free memory to operating system. Either way, the system time is way too high almost use the same amount of user time which is unacceptable for me. So I changed to use this withState version to avoid too many concurrent dynamic memory allocation. And the cpu utilization drops to a much lower level. So if only consider not highly concurrent and not many amount of compress call, there is no need to change to this. Anyway, it still looks strange why this simple change added so much performance hit. As the lz4 code shows LZ4_compressHC_withStateHC definitely does less work.
The only performance hit I can imagine is the initialization cost of the thread local state. If The test only run compress once or once per thread. Then the initialization cost is not amortized and could make the result pretty bad. Or there could be other reasons that's not clear. I'm pretty busy these days, Will come back and dig into it when I get some spare time. |
I guess a trade-off would be to expose a compression "state" so that consumers could decide to allocate on demand (as today) or to cache depending on their needs? |
I guess it's better not doing anything fancy with the JNI implementation. I tried to use the stack for the state, and the performance gain isn't noticable (somewhere between 0.1% and 0.3%). |
LZ4 is actually better than zlib for this scenario as it only allocate memory once and the memory it allocates is smaller than zlib. Which makes the overhead less. Anyway, the case is worse only when memory allocator is under high lock contention, under single thread or less contended case the performance hit is acceptable. |
My guess is that the performance bottleneck of the JNI implementation isn't the allocation of memory for the state, but the JNI interface and the JVM. For the Java implementations, having a cache for the state can improve the performance significantly in most cases. |
Yea, you have a point, Just took a quick look at hotspot code, GetPrimitiveArrayCritical seems to be very heavy. It acquires gc lock first before the actual work. So if there is contention, it would have been more serious contention just for the JNI part. So it probably doesn't worth fixing this now. 2615 JNI_ENTRY(void*, jni_GetPrimitiveArrayCritical(JNIEnv *env, jarray array, jboolean *isCopy))
2616 JNIWrapper("GetPrimitiveArrayCritical");
2617 DTRACE_PROBE3(hotspot_jni, GetPrimitiveArrayCritical__entry, env, array, isCopy);
2618 GC_locker::lock_critical(thread);
2619 if (isCopy != NULL) {
2620 *isCopy = JNI_FALSE;
2621 }
2622 oop a = JNIHandles::resolve_non_null(array);
2623 assert(a->is_array(), "just checking");
2624 BasicType type;
2625 if (a->is_objArray()) {
2626 type = T_OBJECT;
2627 } else {
2628 type = typeArrayKlass::cast(a->klass())->element_type();
2629 }
2630 void* ret = arrayOop(a)->base(type);
2631 DTRACE_PROBE1(hotspot_jni, GetPrimitiveArrayCritical__return, ret);
2632 return ret;
2633 JNI_END |
Use thread local state for high compressor, this can reduce one memory
allocation and free for each compress call to one memory allocation per
thread.
Size of state is roughly 256KB for each thread. Since number of threads normally should not be insanely large, using thread local seems to be acceptable for me. But if other people thinks it is too large, can make it configurable to choose between long term memory footprint and performance. Or extend the interface to either not use singleton instance and save state in compressor instance or allow user to pass in state managed by themselves when calling compress.