-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experiment] gumbo using a malloc arena #2790
base: main
Are you sure you want to change the base?
Conversation
so we can implement an arena-based realloc
https://www.gingerbill.org/article/2019/02/08/memory-allocation-strategies-002/ so I can benchmark the speedup from not having to free chunks
@stevecheckoway Do you have any strong feelings about this approach? It feels like it may be worth the additional complexity to try to finish this work up, but don't want to do it if you're horrified. 😄 |
I didn't read the commits in detail, but the approach of using an arena allocator makes sense. I'm still confused by the Are you sure that's caused by |
@stevecheckoway You may be right about the profiler being confused about the particular function, but the profiler is consistent in saying that children of memory-allocation routines take up a lot of time. Here's the stack profile: |
@stevecheckoway I ran the HTML5 parser under cachegrind and did get similar percentages rolled up under malloc-and-friends (but notably the pthread function is not mentioned, that must be a perftools bug). Here's the top exclusive time functionsI've marked the malloc-related functions with -------------------------------------------------------------------------------- Ir file:function -------------------------------------------------------------------------------- 390,423,514 ( 7.83%) gumbo-parser/utf8.c:read_char [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 366,486,405 ( 7.35%) ** glibc-2.31/malloc/malloc.c:_int_malloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 270,510,622 ( 5.43%) ** glibc-2.31/malloc/malloc.c:_int_free [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 214,266,390 ( 4.30%) gumbo-parser/tokenizer.c:gumbo_lex [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 174,424,705 ( 3.50%) gumbo-parser/string_buffer.c:gumbo_string_buffer_append_codepoint [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 146,082,668 ( 2.93%) ** glibc-2.31/malloc/malloc.c:malloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 141,637,769 ( 2.84%) gumbo-parser/parser.c:handle_in_body [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 137,864,597 ( 2.77%) gumbo-parser/parser.c:gumbo_parse_with_options [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 125,220,415 ( 2.51%) ruby-3.1.2/parse.c:ruby_yyparse [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2] 115,254,958 ( 2.31%) gumbo-parser/utf8.c:utf8iterator_next [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 90,970,620 ( 1.83%) gumbo-parser/tokenizer.c:handle_attr_value_double_quoted_state [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 81,276,039 ( 1.63%) gumbo-parser/parser.c:reconstruct_active_formatting_elements [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 80,979,775 ( 1.62%) ** glibc-2.31/malloc/malloc.c:malloc_consolidate [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 79,724,502 ( 1.60%) gumbo-parser/utf8.h:read_char 70,040,521 ( 1.41%) ** glibc-2.31/malloc/malloc.c:free [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 64,521,678 ( 1.29%) gumbo-parser/tokenizer.c:finish_token.isra.0 [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 63,285,175 ( 1.27%) ruby-3.1.2/parse.y:ruby_yyparse 53,476,387 ( 1.07%) ruby-3.1.2/gc.c:objspace_xmalloc0 [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2] 53,351,571 ( 1.07%) gumbo-parser/parser.c:get_current_node.isra.0 [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 48,078,332 ( 0.96%) ruby-3.1.2/gc.c:ruby_sized_xfree [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2] 47,606,972 ( 0.96%) ** glibc-2.31/malloc/malloc.c:malloc_usable_size [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 44,631,418 ( 0.90%) gumbo-parser/string_buffer.c:maybe_resize_string_buffer [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 43,778,475 ( 0.88%) gumbo-parser/parser.c:insert_text_token.isra.0 [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 42,218,715 ( 0.85%) ruby-3.1.2/st.c:find_table_entry_ind [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2] 42,189,298 ( 0.85%) ** glibc-2.31/malloc/malloc.c:unlink_chunk.isra.0 [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 41,857,797 ( 0.84%) ruby-3.1.2/st.c:rb_st_lookup [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2] 41,302,563 ( 0.83%) ruby-3.1.2/gc.c:rgengc_check_relation.part.0 [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2] 40,823,473 ( 0.82%) ruby-3.1.2/gc.c:gc_mark_ptr [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2] 40,756,430 ( 0.82%) gumbo-parser/tokenizer.c:append_char_to_tag_buffer [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 37,432,874 ( 0.75%) ** glibc-2.31/malloc/malloc.c:realloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 36,069,384 ( 0.72%) gumbo-parser/tokenizer.c:emit_char [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 32,630,299 ( 0.65%) gumbo-parser/tokenizer.c:handle_data_state [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 28,598,943 ( 0.57%) ** glibc-2.31/malloc/malloc.c:_int_realloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so] 28,511,268 ( 0.57%) tmp/x86_64-linux/nokogiri/3.2.1/tmp/x86_64-pc-linux/ports/libxml2/2.10.3/libxml2-2.10.3/xmlstring.c:xmlStrdup [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so] 27,633,613 ( 0.55%) ruby-3.1.2/compile.c:iseq_setup [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2] 25,140,201 ( 0.50%) ** glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc-2.31.so] |
Great! I'm glad there's not some bizarre pthread call happening somewhere. As I mentioned, using an arena allocator makes a lot of sense here. Another possibility would be to try to directly produce an |
What problem is this PR intended to solve?
I've been staring at performance profiling for HTML5 document parsing and can't get past the whopping 26% of time we're spending in the many calls to
malloc
andfree
.Thinking about how we use libgumbo in Nokogiri:
But: we essentially throw away the gumbo data structure immediately every time. What if we used an ephemeral arena for each gumbo document (or fragment) and just reset the entire arena after use?
This PR is an attempt to see how much time we might save, so it's the simplest thing that could work to let me benchmark:
The result: a 20% overall improvement on a large (681kb) real-world doc. Juxtaposed with HTML4 timings for context:
Time spent in memory-related calls drops from 26.3% to 18% in the 681kb document benchmark.