Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8338737: Shenandoah: Reset marking bitmaps after the cycle #516

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

pengxiaolong
Copy link

@pengxiaolong pengxiaolong commented Oct 17, 2024

All the shenandoah passed, still waiting for our test farm to verify performance.

Reset marking bitmaps after collection cycle, for GenShen only do this for young generation, also choose not do this for Degen and full GC since both are running at safepoint, we should leave safepoint as ASAP.

I have run same workload for 30s with shenandoah in generational mode and classic mode, average average time of concurrent reset dropped significantly(283us to 109us for GenShen, 276us to 167us for classic shenandoah.

Classic shenandoah baseline:

[30.973s][info][gc,stats    ] Concurrent Reset               =    0.148 s (a =      276 us) (n =   536) (lvls, us =      141,      205,      240,      291,      694)
[30.973s][info][gc,stats    ] Pause Init Mark (G)            =    0.136 s (a =      253 us) (n =   536) (lvls, us =      119,      242,      254,      264,      926)
[30.973s][info][gc,stats    ] Pause Init Mark (N)            =    0.047 s (a =       87 us) (n =   536) (lvls, us =       62,       84,       88,       91,      110)
...

Classic shenandoah reset bitmaps after cycle:

[30.967s][info][gc,stats    ] Concurrent Reset               =    0.109 s (a =      167 us) (n =   652) (lvls, us =       48,      113,      139,      160,      888)
[30.967s][info][gc,stats    ] Concurrent Reset After Collect =    0.085 s (a =      132 us) (n =   648) (lvls, us =      107,      121,      125,      131,      532)
[30.967s][info][gc,stats    ] Pause Init Mark (G)            =    0.189 s (a =      289 us) (n =   652) (lvls, us =      117,      260,      277,      291,     2297)
[30.967s][info][gc,stats    ] Pause Init Mark (N)            =    0.058 s (a =       89 us) (n =   652) (lvls, us =       62,       85,       89,       94,      143)
...

GenShen baseline

[31.008s][info][gc,stats    ] Concurrent Reset               =    0.107 s (a =      283 us) (n =   379) (lvls, us =      143,      225,      283,      330,      753)
[31.008s][info][gc,stats    ] Pause Init Mark (G)            =    0.098 s (a =      259 us) (n =   379) (lvls, us =      104,      227,      262,      305,     1033)
[31.008s][info][gc,stats    ] Pause Init Mark (N)            =    0.034 s (a =       90 us) (n =   379) (lvls, us =       67,       81,       89,       99,      130)
...

GenShen reset bitmaps after cycle

[30.977s][info][gc,stats    ] Concurrent Reset               =    0.050 s (a =      109 us) (n =   462) (lvls, us =       54,       77,      100,      125,      496)
[30.977s][info][gc,stats    ] Concurrent Reset After Collect =    0.130 s (a =      281 us) (n =   462) (lvls, us =      104,      193,      229,      326,     1120)
[30.978s][info][gc,stats    ] Pause Init Mark (G)            =    0.111 s (a =      241 us) (n =   462) (lvls, us =      105,      223,      240,      273,      631)
[30.978s][info][gc,stats    ] Pause Init Mark (N)            =    0.040 s (a =       87 us) (n =   462) (lvls, us =       64,       75,       87,       98,      122)
...

Throughput is also improved by ~21% if use the GC counts as proxy metrics.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8338737: Shenandoah: Reset marking bitmaps after the cycle (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/shenandoah.git pull/516/head:pull/516
$ git checkout pull/516

Update a local copy of the PR:
$ git checkout pull/516
$ git pull https://git.openjdk.org/shenandoah.git pull/516/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 516

View PR using the GUI difftool:
$ git pr show -t 516

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/shenandoah/pull/516.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 17, 2024

👋 Welcome back xpeng! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 17, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@pengxiaolong pengxiaolong marked this pull request as ready for review October 17, 2024 20:47
@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 17, 2024
@mlbridge
Copy link

mlbridge bot commented Oct 17, 2024

Copy link
Contributor

@earthling-amzn earthling-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about the correctness here. Consider this scenario:

  1. At the end of a young or mixed collection, op_reset_after_collect will reset young regions and unset the needs reset flag for the young generation.
  2. An old region (A) from the mixed collection could be repurposed as a young region (it's bitmap will not be reset when it is recycled).
  3. The next young/mixed collection will not reset the bitmap for region A because the needs reset flag is clear.
  4. The collector will see stale mark bits for region A and will probably crash.

@pengxiaolong
Copy link
Author

I have some concerns about the correctness here. Consider this scenario:

  1. At the end of a young or mixed collection, op_reset_after_collect will reset young regions and unset the needs reset flag for the young generation.
  2. An old region (A) from the mixed collection could be repurposed as a young region (it's bitmap will not be reset when it is recycled).
  3. The next young/mixed collection will not reset the bitmap for region A because the needs reset flag is clear.
  4. The collector will see stale mark bits for region A and will probably crash.

My understanding is: op_reset_after_collect is called at the end of a successful collection after the old region (A) is repurposed as a young region, if so it is not an issue, I'll double confirm.

@pengxiaolong
Copy link
Author

I have some concerns about the correctness here. Consider this scenario:

  1. At the end of a young or mixed collection, op_reset_after_collect will reset young regions and unset the needs reset flag for the young generation.
  2. An old region (A) from the mixed collection could be repurposed as a young region (it's bitmap will not be reset when it is recycled).
  3. The next young/mixed collection will not reset the bitmap for region A because the needs reset flag is clear.
  4. The collector will see stale mark bits for region A and will probably crash.

My understanding is: op_reset_after_collect is called at the end of a successful collection after the old region (A) is repurposed as a young region, if so it is not an issue, I'll double confirm.

op_reset_after_collect also reset the regions are not affiliated, after a mixed/young collection, the old region should have been recycled before resetting bitmaps, so even the recycled old region is repurposed as young region later, it should be ok.

If the young/mixed_ collection is set to bootstrap old GC, the reset of bitmaps is delayed to old GC to avoid dirty bitmaps

@pengxiaolong
Copy link
Author

pengxiaolong commented Oct 18, 2024

I have some concerns about the correctness here. Consider this scenario:

  1. At the end of a young or mixed collection, op_reset_after_collect will reset young regions and unset the needs reset flag for the young generation.
  2. An old region (A) from the mixed collection could be repurposed as a young region (it's bitmap will not be reset when it is recycled).
  3. The next young/mixed collection will not reset the bitmap for region A because the needs reset flag is clear.
  4. The collector will see stale mark bits for region A and will probably crash.

My understanding is: op_reset_after_collect is called at the end of a successful collection after the old region (A) is repurposed as a young region, if so it is not an issue, I'll double confirm.

op_reset_after_collect also reset the regions are not affiliated, after a mixed/young collection, the old region should have been recycled before resetting bitmaps, so even the recycled old region is repurposed as young region later, it should be ok.

If the young/mixed_ collection is set to bootstrap old GC, the reset of bitmaps is delayed to old GC to avoid dirty bitmaps

By repeatedly running test gc/stress/gcold/TestGCOldWithShenandoah.java#generational many times, I could produce a crash, not sure root cause yet, but this might be the root cause.

@pengxiaolong
Copy link
Author

By repeatedly running test gc/stress/gcold/TestGCOldWithShenandoah.java#generational many times, I could produce a crash, not sure root cause yet, but this might be the root cause.

It is not a crash, looks like it is not related, I'll run test w/o the change to verify:

Exception in thread "main" java.lang.NullPointerException: Inflater has been closed
        at java.base/java.util.zip.Inflater.ensureOpen(Inflater.java:705)
        at java.base/java.util.zip.Inflater.reset(Inflater.java:676)
        at java.base/java.util.zip.ZipFile$CleanableResource.releaseInflater(ZipFile.java:764)
        at java.base/java.util.zip.ZipFile$InflaterCleanupAction.run(ZipFile.java:440)
        at java.base/jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:178)
        at java.base/jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133)
        at java.base/java.util.zip.ZipFile$ZipFileInflaterInputStream.close(ZipFile.java:470)
        at java.base/jdk.internal.loader.Resource.getBytes(Resource.java:121)
        at java.base/jdk.internal.loader.URLClassPath$JarLoader$2.getBytes(URLClassPath.java:872)
        at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:859)
        at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:528)
        at com.sun.javatest.regtest.agent.MainWrapper.main(MainWrapper.java:94)

@pengxiaolong
Copy link
Author

It is not a crash, looks like it is not related, I'll run test w/o the change to verify:

I have run TestGCOldWithShenandoah.java#generational on GenShen tip, looks likes GenShen has bug causing stale mark bits, although it is rare, the test failed after 73 repeats with crash:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/xlpeng/repos/jdk-xlpeng/src/hotspot/share/gc/shenandoah/shenandoahConcurrentGC.cpp:622), pid=26873, tid=26959
#  assert(_generation->is_bitmap_clear()) failed: need clear marking bitmap
#
# JRE version: OpenJDK Runtime Environment (24.0) (fastdebug build 24-internal-adhoc.xlpeng.jdk-xlpeng)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 24-internal-adhoc.xlpeng.jdk-xlpeng, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x172ac44]  ShenandoahConcurrentGC::op_init_mark()+0x384
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /local/home/xlpeng/repos/jdk-xlpeng/build/linux-x86_64-server-fastdebug/test-support/jtreg_test_hotspot_jtreg_gc_stress_gcold_TestGCOldWithShenandoah_java_generational/scratch/0/hs_err_pid26873.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#

@pengxiaolong
Copy link
Author

My change also has stale mark bits issue I believe, gitfarm Genshen test fails sometimes due to crash caused by stale bitmap marks.

Amended one fix in this PR, it should solve the issue: d4fddac#diff-7b87cde2e13d8769dc21cf6f7ae45c92942776fcb44275339182b7fa5c290ce0

I have run TestGCOldWithShenandoah.java#generational 100 times w/o crash caused by stale bitmap marks last night.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

2 participants