streaming/simd-based parsing? #51

gbaz · 2019-09-14T18:19:58Z

We have found really nice performance on the hw-json stuff once we moved to the streaming/simd stuff that's now available, even on huge files.

But it appears xml cursors can only be created via blanking, which is not streaming. As such, handling 25G or so files seems out of the question without a lot of memory lying around.

Is there some streaming parse stuff I missed, or can it be added?

newhoggy · 2019-09-16T08:57:37Z

SIMD support will be quite a lot of work, but perhaps as an interim the blanking stuff can be converted to use streaming.

newhoggy · 2019-09-23T06:39:48Z

I've released a new version hw-xml-0.4.0.0.

There is a new count command which serves an example on how to use the library. This command has three methods. mmap, memory and naive.

naive is the old way of loading a cursor.
memory streams into a cursor in-memory.
mmap is like memory except it memory maps the XML text so any part of the text that is no accessed is not loaded into memory.

Please let me know if the memory or mmap methods improve the situation for you.

gbaz · 2019-09-23T19:02:06Z

Thanks! I'll take a look. I sort of sketched the same idea I think in a branch I was playing with. There's no reason for you to look at it for that. However I also found out through some light benchmarking that using a direct jump rather than a lookup for the interests stuff shaved some time, so you might want to borrow that: master...gbaz:gb/streaming-maybe#diff-971ecc88a60c0833627e89c27934eeacR28

newhoggy · 2019-09-27T00:17:13Z

I’m adding various create- commands to hw-xml so we can measure the performance of various parts of the parser.

For example create-index, create-ib-index, create-bp-index, create-blanked-xml.

I trust these measurements more than profiling because instrumentation can affect performance.

This was the approach I took with hw-dsv and hw-xml predates hw-dsv, so it lacked this support.

newhoggy · 2019-09-27T02:19:05Z

I checked the performance improvement of your branch and it looks worthwhile:

$ time hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx      32.75s user 3.87s system 214% cpu 17.036 total
$ time hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx      32.03s user 3.60s system 210% cpu 16.898 total
$ time hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx      31.17s user 3.86s system 206% cpu 16.996 total
$ time hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx    32.28s user 3.83s system 218% cpu 16.543 total
$ time hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx    29.86s user 3.90s system 206% cpu 16.384 total
$ time hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx    30.92s user 3.83s system 213% cpu 16.289 total

newhoggy · 2019-09-27T02:37:00Z

I think table lookup is the right approach, but that UArray has terrible performance.

newhoggy · 2019-09-27T02:56:09Z

Merged to optimisations:

https://github.com/haskell-works/hw-xml/pull/78
https://github.com/haskell-works/hw-xml/pull/79

newhoggy · 2019-09-27T03:07:34Z

New version released: hw-xml-0.4.0.1

gbaz · 2019-09-27T22:38:31Z

I'd think that the case statement that matches on a primitive would be equally good, since that should just get directly code-generated into a jump table by ghc. I don't imagine that's very different than a good lookup table anyway :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streaming/simd-based parsing? #51

streaming/simd-based parsing? #51

gbaz commented Sep 14, 2019

newhoggy commented Sep 16, 2019 •

edited

Loading

newhoggy commented Sep 23, 2019

gbaz commented Sep 23, 2019

newhoggy commented Sep 27, 2019 •

edited

Loading

newhoggy commented Sep 27, 2019

newhoggy commented Sep 27, 2019

newhoggy commented Sep 27, 2019

newhoggy commented Sep 27, 2019

gbaz commented Sep 27, 2019

streaming/simd-based parsing? #51

streaming/simd-based parsing? #51

Comments

gbaz commented Sep 14, 2019

newhoggy commented Sep 16, 2019 • edited Loading

newhoggy commented Sep 23, 2019

gbaz commented Sep 23, 2019

newhoggy commented Sep 27, 2019 • edited Loading

newhoggy commented Sep 27, 2019

newhoggy commented Sep 27, 2019

newhoggy commented Sep 27, 2019

newhoggy commented Sep 27, 2019

gbaz commented Sep 27, 2019

newhoggy commented Sep 16, 2019 •

edited

Loading

newhoggy commented Sep 27, 2019 •

edited

Loading