Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming/simd-based parsing? #51

Open
gbaz opened this issue Sep 14, 2019 · 9 comments
Open

streaming/simd-based parsing? #51

gbaz opened this issue Sep 14, 2019 · 9 comments

Comments

@gbaz
Copy link

gbaz commented Sep 14, 2019

We have found really nice performance on the hw-json stuff once we moved to the streaming/simd stuff that's now available, even on huge files.

But it appears xml cursors can only be created via blanking, which is not streaming. As such, handling 25G or so files seems out of the question without a lot of memory lying around.

Is there some streaming parse stuff I missed, or can it be added?

@newhoggy
Copy link
Member

newhoggy commented Sep 16, 2019

SIMD support will be quite a lot of work, but perhaps as an interim the blanking stuff can be converted to use streaming.

@newhoggy
Copy link
Member

I've released a new version hw-xml-0.4.0.0.

There is a new count command which serves an example on how to use the library. This command has three methods. mmap, memory and naive.

naive is the old way of loading a cursor.
memory streams into a cursor in-memory.
mmap is like memory except it memory maps the XML text so any part of the text that is no accessed is not loaded into memory.

Please let me know if the memory or mmap methods improve the situation for you.

@gbaz
Copy link
Author

gbaz commented Sep 23, 2019

Thanks! I'll take a look. I sort of sketched the same idea I think in a branch I was playing with. There's no reason for you to look at it for that. However I also found out through some light benchmarking that using a direct jump rather than a lookup for the interests stuff shaved some time, so you might want to borrow that: master...gbaz:gb/streaming-maybe#diff-971ecc88a60c0833627e89c27934eeacR28

@newhoggy
Copy link
Member

newhoggy commented Sep 27, 2019

I’m adding various create- commands to hw-xml so we can measure the performance of various parts of the parser.

For example create-index, create-ib-index, create-bp-index, create-blanked-xml.

I trust these measurements more than profiling because instrumentation can affect performance.

This was the approach I took with hw-dsv and hw-xml predates hw-dsv, so it lacked this support.

@newhoggy
Copy link
Member

I checked the performance improvement of your branch and it looks worthwhile:

$ time hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx      32.75s user 3.87s system 214% cpu 17.036 total
$ time hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx      32.03s user 3.60s system 210% cpu 16.898 total
$ time hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx      31.17s user 3.86s system 206% cpu 16.996 total
$ time hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx    32.28s user 3.83s system 218% cpu 16.543 total
$ time hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx    29.86s user 3.90s system 206% cpu 16.384 total
$ time hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx --bp-output psd7003.xml.bp.idx --method stream
hw-xml-gb-1 create-index --input psd7003.xml --ib-output psd7003.xml.ib.idx    30.92s user 3.83s system 213% cpu 16.289 total

@newhoggy
Copy link
Member

I think table lookup is the right approach, but that UArray has terrible performance.

@newhoggy
Copy link
Member

Merged to optimisations:

https://github.com/haskell-works/hw-xml/pull/78
https://github.com/haskell-works/hw-xml/pull/79

@newhoggy
Copy link
Member

New version released: hw-xml-0.4.0.1

@gbaz
Copy link
Author

gbaz commented Sep 27, 2019

I'd think that the case statement that matches on a primitive would be equally good, since that should just get directly code-generated into a jump table by ghc. I don't imagine that's very different than a good lookup table anyway :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants