Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use thread local state for high compressor #47

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Use thread local state for high compressor #47

wants to merge 1 commit into from

Conversation

sunxiaoguang
Copy link
Contributor

Use thread local state for high compressor, this can reduce one memory
allocation and free for each compress call to one memory allocation per
thread.

Size of state is roughly 256KB for each thread. Since number of threads normally should not be insanely large, using thread local seems to be acceptable for me. But if other people thinks it is too large, can make it configurable to choose between long term memory footprint and performance. Or extend the interface to either not use singleton instance and save state in compressor instance or allow user to pass in state managed by themselves when calling compress.

Use thread local state for high compressor, this can reduce one memory
allocation and free for each compress call to one memory allocation per
thread.
@sunxiaoguang
Copy link
Contributor Author

Rebase the pull request #46

@linnaea
Copy link
Contributor

linnaea commented Aug 19, 2014

I'm against merging this.

I tried the same thing (thought with a different caching approach) around a month ago and found it would cause a huge performance penalty. (like 3x slower)

I used a global cache for those hash table (linnaea/MineBackupDelta@5191b8be93c0d15dc64d537b6395bfdfafd79e1f) and reverted it for the native compressors later (linnaea/MineBackupDelta@a3344cf4937833a9f54968e7343f3eb56c842fbd).

And the benchmark I did here showed that using ThreadLocal instead of the global cache would only improve performance by around 2%, so the cache itself should not be the bottleneck, or maybe I just got everything wrong.

@sunxiaoguang
Copy link
Contributor Author

I actually don't have concrete data to back this. It started by my other part of code that's totally in C++ and has nothing related to Java. That code calls compress many many times to compress huge amount of pages.This way, there will be many memory allocation calls. Also the compressions happens in many threads concurrently. Under such scenario, the lock contention inside libc's allocator poses high overhead. Using jemalloc on the other hand has less lock contention, but makes a lot of madvise call to free memory to operating system. Either way, the system time is way too high almost use the same amount of user time which is unacceptable for me. So I changed to use this withState version to avoid too many concurrent dynamic memory allocation. And the cpu utilization drops to a much lower level. So if only consider not highly concurrent and not many amount of compress call, there is no need to change to this.

Anyway, it still looks strange why this simple change added so much performance hit. As the lz4 code shows LZ4_compressHC_withStateHC definitely does less work.


int LZ4_compressHC2_withStateHC (void* state, const char* source, char* dest, int inputSize, int compressionLevel)
{
    if (((size_t)(state)&(sizeof(void*)-1)) != 0) return 0;   /* Error : state is not aligned for pointers (32 or 64 bits) */
    LZ4_initHC ((LZ4HC_Data_Structure*)state, (const BYTE*)source);
    return LZ4HC_compress_generic (state, source, dest, inputSize, 0, compressionLevel, noLimit);
}
void* LZ4_createHC (const char* inputBuffer)
{
    void* hc4 = ALLOCATOR(sizeof(LZ4HC_Data_Structure));
    LZ4_initHC ((LZ4HC_Data_Structure*)hc4, (const BYTE*)inputBuffer);
    return hc4;
}
int LZ4_compressHC2_limitedOutput(const char* source, char* dest, int inputSize, int maxOutputSize, int compressionLevel)
{
    void* ctx = LZ4_createHC(source);
    int result;
    if (ctx==NULL) return 0;
    result = LZ4HC_compress_generic (ctx, source, dest, inputSize, maxOutputSize, compressionLevel, limitedOutput);
    LZ4_freeHC(ctx);
    return result;
}

The only performance hit I can imagine is the initialization cost of the thread local state. If The test only run compress once or once per thread. Then the initialization cost is not amortized and could make the result pretty bad. Or there could be other reasons that's not clear. I'm pretty busy these days, Will come back and dig into it when I get some spare time.

@jpountz
Copy link
Collaborator

jpountz commented Sep 25, 2014

I guess a trade-off would be to expose a compression "state" so that consumers could decide to allocate on demand (as today) or to cache depending on their needs?

@linnaea
Copy link
Contributor

linnaea commented Sep 26, 2014

I guess it's better not doing anything fancy with the JNI implementation. I tried to use the stack for the state, and the performance gain isn't noticable (somewhere between 0.1% and 0.3%).

@sunxiaoguang
Copy link
Contributor Author

LZ4 is actually better than zlib for this scenario as it only allocate memory once and the memory it allocates is smaller than zlib. Which makes the overhead less. Anyway, the case is worse only when memory allocator is under high lock contention, under single thread or less contended case the performance hit is acceptable.

@linnaea
Copy link
Contributor

linnaea commented Sep 26, 2014

My guess is that the performance bottleneck of the JNI implementation isn't the allocation of memory for the state, but the JNI interface and the JVM. For the Java implementations, having a cache for the state can improve the performance significantly in most cases.

@sunxiaoguang
Copy link
Contributor Author

Yea, you have a point, Just took a quick look at hotspot code, GetPrimitiveArrayCritical seems to be very heavy. It acquires gc lock first before the actual work. So if there is contention, it would have been more serious contention just for the JNI part. So it probably doesn't worth fixing this now.

2615 JNI_ENTRY(void*, jni_GetPrimitiveArrayCritical(JNIEnv *env, jarray array, jboolean *isCopy))
2616   JNIWrapper("GetPrimitiveArrayCritical");
2617   DTRACE_PROBE3(hotspot_jni, GetPrimitiveArrayCritical__entry, env, array, isCopy);
2618   GC_locker::lock_critical(thread);
2619   if (isCopy != NULL) {
2620     *isCopy = JNI_FALSE;
2621   }
2622   oop a = JNIHandles::resolve_non_null(array);
2623   assert(a->is_array(), "just checking");
2624   BasicType type;
2625   if (a->is_objArray()) {
2626     type = T_OBJECT;
2627   } else {
2628     type = typeArrayKlass::cast(a->klass())->element_type();
2629   }
2630   void* ret = arrayOop(a)->base(type);
2631   DTRACE_PROBE1(hotspot_jni, GetPrimitiveArrayCritical__return, ret);
2632   return ret;
2633 JNI_END

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants