Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single-thread mode #3239

Open
notestaff opened this issue Feb 23, 2024 · 4 comments
Open

single-thread mode #3239

notestaff opened this issue Feb 23, 2024 · 4 comments
Labels
feature/proposal a new feature or an idea of question a user question how to do certain things

Comments

@notestaff
Copy link

Is it possible for seqan3-based programs to use only one CPU? I tried setting seqan3::contrib::bgzf_thread_count to 1, but the BAM-reading program still uses 200% CPU according to GNU time: one main thread and one for seqan3's decompression. Looking at the code, setting seqan3::contrib::bgzf_thread_count to 0 would not be supported, correct?

I'm trying to make a CLI like that of samtools: using one CPU by default, with an option to specify additional CPUs. Is there a way to do that? Thanks!
@eseiler

@notestaff notestaff added the question a user question how to do certain things label Feb 23, 2024
@eseiler
Copy link
Member

eseiler commented Feb 23, 2024

The bgzf handling is built around using a threadpool, so it always spawns at least one thread.

It should be possible to use the constructor via stream for the input.

If there should only be one thread, the gz_stream could be used, which should work for bgzf compressed files.

So:

  • check if file is BAM/bgzf compressed (easy, but not reliable: file extension. Harder, but reliable: magic number)
  • if not: just use Sam_file_input with filename
  • construct fstream from file
  • construct gz stream from fstream
  • use sam_file_input ctor with gz stream and format_bam as format

Haven't tried it yet, but this would be my hacky workaround.

As for our code:
It should be possible to just use gz (for input) if there is one thread requested. Not sure about performance implications for decompressing (probably none?). We can't really do it for output, because we would then write a gz file instead of bgzf.

@rrahn
Copy link
Contributor

rrahn commented May 8, 2024

This seems to me like a recurring issue and I am wondering if the mechanism to switch to gz-decompression in favor of bgzf-compression should be more straightforward to handle in the API.

@eseiler
Copy link
Member

eseiler commented May 8, 2024

This seems to me like a recurring issue and I am wondering if the mechanism to switch to gz-decompression in favor of bgzf-compression should be more straightforward to handle in the API.

I agree.

Another thing we had is that we used to write bgzf files when gz output was requested.
bgzf is faster because it can be parallelised. However, bgzf is not the same as gz, though it's compatible.
The binary representation is different and the file size differs (I think I had a case were a bgzf compressed FASTA file was 20% bigger than the gz compressed counterpart).

@rrahn
Copy link
Contributor

rrahn commented May 8, 2024

True. Following this, I could make out the following four possible decisions that could be made by the user:

On output

  1. Use bgzf for output compression
    • default by spec
    • random access support
    • serial (no separate decompression thread) or parallel (at least two threads: 1 main, >= 1 decompression worker)
    • Which mode is default? If parallel how many threads are default?
  2. Allow user to explicitly switch to gz-compression
    • no random access support
    • always single-threaded

On Input

  1. Use bgzf-decompression if bgzf-decompressed
    • default by spec
    • always parallel
    • serial (no separate decompression thread) or parallel (at least two threads: 1 main, >= 1
  2. Allow user to explicitly use gz-decompression
    • always serial
    • independent of bgzf or gz-compression

@rrahn rrahn added the feature/proposal a new feature or an idea of label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/proposal a new feature or an idea of question a user question how to do certain things
Projects
None yet
Development

No branches or pull requests

3 participants