Use thread local state for high compressor #47

sunxiaoguang · 2014-08-19T09:17:43Z

Use thread local state for high compressor, this can reduce one memory
allocation and free for each compress call to one memory allocation per
thread.

Size of state is roughly 256KB for each thread. Since number of threads normally should not be insanely large, using thread local seems to be acceptable for me. But if other people thinks it is too large, can make it configurable to choose between long term memory footprint and performance. Or extend the interface to either not use singleton instance and save state in compressor instance or allow user to pass in state managed by themselves when calling compress.

Use thread local state for high compressor, this can reduce one memory allocation and free for each compress call to one memory allocation per thread.

sunxiaoguang · 2014-08-19T09:18:41Z

Rebase the pull request #46

linnaea · 2014-08-19T14:16:22Z

I'm against merging this.

I tried the same thing (thought with a different caching approach) around a month ago and found it would cause a huge performance penalty. (like 3x slower)

I used a global cache for those hash table (linnaea/MineBackupDelta@5191b8be93c0d15dc64d537b6395bfdfafd79e1f) and reverted it for the native compressors later (linnaea/MineBackupDelta@a3344cf4937833a9f54968e7343f3eb56c842fbd).

And the benchmark I did here showed that using ThreadLocal instead of the global cache would only improve performance by around 2%, so the cache itself should not be the bottleneck, or maybe I just got everything wrong.

sunxiaoguang · 2014-08-19T14:47:23Z

I actually don't have concrete data to back this. It started by my other part of code that's totally in C++ and has nothing related to Java. That code calls compress many many times to compress huge amount of pages.This way, there will be many memory allocation calls. Also the compressions happens in many threads concurrently. Under such scenario, the lock contention inside libc's allocator poses high overhead. Using jemalloc on the other hand has less lock contention, but makes a lot of madvise call to free memory to operating system. Either way, the system time is way too high almost use the same amount of user time which is unacceptable for me. So I changed to use this withState version to avoid too many concurrent dynamic memory allocation. And the cpu utilization drops to a much lower level. So if only consider not highly concurrent and not many amount of compress call, there is no need to change to this.

Anyway, it still looks strange why this simple change added so much performance hit. As the lz4 code shows LZ4_compressHC_withStateHC definitely does less work.


int LZ4_compressHC2_withStateHC (void* state, const char* source, char* dest, int inputSize, int compressionLevel)
{
    if (((size_t)(state)&(sizeof(void*)-1)) != 0) return 0;   /* Error : state is not aligned for pointers (32 or 64 bits) */
    LZ4_initHC ((LZ4HC_Data_Structure*)state, (const BYTE*)source);
    return LZ4HC_compress_generic (state, source, dest, inputSize, 0, compressionLevel, noLimit);
}
void* LZ4_createHC (const char* inputBuffer)
{
    void* hc4 = ALLOCATOR(sizeof(LZ4HC_Data_Structure));
    LZ4_initHC ((LZ4HC_Data_Structure*)hc4, (const BYTE*)inputBuffer);
    return hc4;
}
int LZ4_compressHC2_limitedOutput(const char* source, char* dest, int inputSize, int maxOutputSize, int compressionLevel)
{
    void* ctx = LZ4_createHC(source);
    int result;
    if (ctx==NULL) return 0;
    result = LZ4HC_compress_generic (ctx, source, dest, inputSize, maxOutputSize, compressionLevel, limitedOutput);
    LZ4_freeHC(ctx);
    return result;
}

The only performance hit I can imagine is the initialization cost of the thread local state. If The test only run compress once or once per thread. Then the initialization cost is not amortized and could make the result pretty bad. Or there could be other reasons that's not clear. I'm pretty busy these days, Will come back and dig into it when I get some spare time.

jpountz · 2014-09-25T20:17:54Z

I guess a trade-off would be to expose a compression "state" so that consumers could decide to allocate on demand (as today) or to cache depending on their needs?

linnaea · 2014-09-26T00:19:43Z

I guess it's better not doing anything fancy with the JNI implementation. I tried to use the stack for the state, and the performance gain isn't noticable (somewhere between 0.1% and 0.3%).

sunxiaoguang · 2014-09-26T00:24:44Z

LZ4 is actually better than zlib for this scenario as it only allocate memory once and the memory it allocates is smaller than zlib. Which makes the overhead less. Anyway, the case is worse only when memory allocator is under high lock contention, under single thread or less contended case the performance hit is acceptable.

linnaea · 2014-09-26T00:27:22Z

My guess is that the performance bottleneck of the JNI implementation isn't the allocation of memory for the state, but the JNI interface and the JVM. For the Java implementations, having a cache for the state can improve the performance significantly in most cases.

sunxiaoguang · 2014-09-26T13:41:22Z

Yea, you have a point, Just took a quick look at hotspot code, GetPrimitiveArrayCritical seems to be very heavy. It acquires gc lock first before the actual work. So if there is contention, it would have been more serious contention just for the JNI part. So it probably doesn't worth fixing this now.

2615 JNI_ENTRY(void*, jni_GetPrimitiveArrayCritical(JNIEnv *env, jarray array, jboolean *isCopy))
2616   JNIWrapper("GetPrimitiveArrayCritical");
2617   DTRACE_PROBE3(hotspot_jni, GetPrimitiveArrayCritical__entry, env, array, isCopy);
2618   GC_locker::lock_critical(thread);
2619   if (isCopy != NULL) {
2620     *isCopy = JNI_FALSE;
2621   }
2622   oop a = JNIHandles::resolve_non_null(array);
2623   assert(a->is_array(), "just checking");
2624   BasicType type;
2625   if (a->is_objArray()) {
2626     type = T_OBJECT;
2627   } else {
2628     type = typeArrayKlass::cast(a->klass())->element_type();
2629   }
2630   void* ret = arrayOop(a)->base(type);
2631   DTRACE_PROBE1(hotspot_jni, GetPrimitiveArrayCritical__return, ret);
2632   return ret;
2633 JNI_END

Use thread local state for high compressor

2d10bd9

Use thread local state for high compressor, this can reduce one memory allocation and free for each compress call to one memory allocation per thread.

jpountz added the enhancement label Sep 25, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use thread local state for high compressor #47

Use thread local state for high compressor #47

sunxiaoguang commented Aug 19, 2014

sunxiaoguang commented Aug 19, 2014

linnaea commented Aug 19, 2014

sunxiaoguang commented Aug 19, 2014

jpountz commented Sep 25, 2014

linnaea commented Sep 26, 2014

sunxiaoguang commented Sep 26, 2014

linnaea commented Sep 26, 2014

sunxiaoguang commented Sep 26, 2014

Use thread local state for high compressor #47

Are you sure you want to change the base?

Use thread local state for high compressor #47

Conversation

sunxiaoguang commented Aug 19, 2014

sunxiaoguang commented Aug 19, 2014

linnaea commented Aug 19, 2014

sunxiaoguang commented Aug 19, 2014

jpountz commented Sep 25, 2014

linnaea commented Sep 26, 2014

sunxiaoguang commented Sep 26, 2014

linnaea commented Sep 26, 2014

sunxiaoguang commented Sep 26, 2014