Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

results.cov.bed.gz last line #84

Open
gevro opened this issue Jan 17, 2024 · 25 comments
Open

results.cov.bed.gz last line #84

gevro opened this issue Jan 17, 2024 · 25 comments

Comments

@gevro
Copy link

gevro commented Jan 17, 2024

Hi,
The nanoseq pipeline vs v3.2.1 has a minor bug in that it outputs this as the last line in results.cov.bed.gz:

	0	1	;;0 

I don't know yet if this happens in v3.5.4, but letting you know in case you see this bug and it persisted in v3.5.4.

@gevro gevro changed the title results.cov.bed.gz results.cov.bed.gz last line Jan 17, 2024
@fa8sanger
Copy link
Collaborator

fa8sanger commented Jan 18, 2024 via email

@gevro gevro closed this as completed Jan 18, 2024
@gevro
Copy link
Author

gevro commented Jan 25, 2024

Hi,
This bug has actually not been fixed yet in v3.5.4. Last line of one results.cov.bed.gz:

	740435584	740435585	;;0

I noticed this again because it caused this error in the 'post' step tabix, so it is causing the pipeline to crash:

Executing: tabix -f sampleID/tmpNanoSeq/post/results.cov.bed.gz

Traceback (most recent call last):
  File "/opt/wtsi-cgp/bin/runNanoSeq.py", line 1146, in <module>
    runCommand(cmd)
  File "/opt/wtsi-cgp/bin/runNanoSeq.py", line 430, in runCommand
    raise ValueError(error)
ValueError: [E::hts_idx_check_range] Region 740435584..740435585 cannot be stored in a tbi index. Try using a csi index
tbx_index_build failed: sampleID/tmpNanoSeq/post/results.cov.bed.gz

It looks like partition 32 of the var output is the issue:

$ bgzip -cd 32.cov.bed.gz
	740435584	740435585	;;0

Is there a reason why partition 32 of var is empty? I see there is a check in the 'post' code to see if files are 0 bytes:

if ( os.stat(ifile).st_size == 0 ) : continue

But this doesn't catch these files that have one malformed line.

I can try a workaround by adding an awk script to the post that removes any rows with blank first column, but probably there is some upstream issue causing this to happen. I think this workaround (adding awk filter for NF==4) will work until the upstream bug is fixed:

cmd += "bgzip -dc %s | awk \'NF==4\' >> %s ;" % (ifile, outFile)

@gevro gevro reopened this Jan 25, 2024
@fa8sanger
Copy link
Collaborator

fa8sanger commented Jan 25, 2024 via email

@gevro
Copy link
Author

gevro commented Jan 25, 2024

  1. When we reran the pipeline with v3.5.4, we started from the beginning: cov, part, dsa, var, etc. So the issue is still present in v3.5.4

  2. I just checked the log for the 2 samples that had this issue. I don't see any error messages for chunk32 for var or dsa steps. In both of those samples, interestingly chunk 32 gets executed first for both dsa and var, not sure why. Maybe that has something to do with it. And both of them span chrY regions. I think most likely the issue is there are no mutations or candidate mutations in that chunk and somehow that causes some empty data variable/data structure in the upstream code that leads to this issue. If I had to guess, it probably means the coverage calculations and trinucleotide background info relating to that chunk may be wrong.

Note, this happened for 2 out of ~30 samples.

@gevro
Copy link
Author

gevro commented Feb 5, 2024

Just FYI: v3.5.5 still has this bug.

@fa8sanger
Copy link
Collaborator

fa8sanger commented Feb 5, 2024 via email

@gevro
Copy link
Author

gevro commented Feb 5, 2024

Sorry I deleted the temp folder already. But last time when I checked, there are no dsa or var errors for the problematic chunk.

As a workaround, I changed the post step script to only keep lines with 4 fields using awk when the bed coverage files are merged.

@fa8sanger
Copy link
Collaborator

fa8sanger commented Feb 6, 2024 via email

@gevro
Copy link
Author

gevro commented Feb 6, 2024

Will try that. However, if that is the fix, will we need to rerun the pipeline on all the samples? i.e. does this fix change the final coverage BED or the variant calls relative to my workaround of adding the below awk filter in the post step?
cmd += "bgzip -dc %s | awk 'NF==4' >> %s ;" % (ifile, outFile)

@gevro
Copy link
Author

gevro commented Feb 6, 2024

Also, do you have brief instructions on compilation after cloning the repository into the docker? I tried playing around with the Makefile and scripts in the build directory, but running into issues. Or if there is a simple way to only compile variantcaller.cc.

@fa8sanger
Copy link
Collaborator

fa8sanger commented Feb 6, 2024 via email

@gevro
Copy link
Author

gevro commented Feb 7, 2024

I've compiled and testing the bug fix (I changed the two lines in variantcaller.cc as you suggested), but I want to make sure it is an accurate fix.

I can see now that the 32.cov.bed.gz is empty. However, the 32.var has:

  • 3 lines that begin with 'Mismatches'
  • A bunch of lines that begin with CallVsQpos
  • A bunch of lines that begin with PyrVsMask
  • A bunch of lines that beign with ReadBundles
  • Final two lines:
    Burdens 1 0 223973
    Coverage 11060

Does the final line indicate that there should be coverage in 32.cov.bed.gz? i.e. Coverage = 11060. So why is 32.cov.bed.gz now empty? I don't know what these lines encode exactly.

Thanks

@gevro
Copy link
Author

gevro commented Feb 7, 2024

Also one more potential issue. I saved md5sums of all the var outputs before the bug fix, and I'm comparing to the md5sums of those outputs after the bug fix. The md5sums of all the cov.bed.gz files and all the .var files are different after the bug fix. But I would have expected a change only for the problematic last chunk #32 files, no?

@fa8sanger
Copy link
Collaborator

fa8sanger commented Feb 8, 2024 via email

@fa8sanger
Copy link
Collaborator

fa8sanger commented Feb 8, 2024 via email

@gevro
Copy link
Author

gevro commented Feb 9, 2024

Sure, I will email it offline.

@gevro
Copy link
Author

gevro commented Feb 11, 2024

Hi,
I tested variantcaller with the two changes you suggested, this time making the changes on the v3.5.5 version of variantcaller.cc.

This confirmed that all var output files are identical except for 32.cov.bed.gz.

32.cov.bed.gz in the old version:
$ zcat 32.cov.bed.gz
0 1 ;?;0

32.cov.bed.gz in the new version is empty. But note the file is still 20 bytes in size, not 0 bytes, because after bgzip, there is a minimum file size even for empty files.
$ ls -lh 32.cov.bed.gz
-rw-rw-r--. 20 Feb 11 10:51 32.cov.bed.gz

So I think this change is safe to make. But note that I tested it with v3.5.5 variantcaller.cc, not with the ‘develop’ version. So I can’t vouch for how these changes will behave in the ‘develop’ branch version of variantcaller.cc.

Thanks

@fa8sanger
Copy link
Collaborator

fa8sanger commented Feb 11, 2024 via email

@gevro
Copy link
Author

gevro commented Feb 11, 2024

Yes all good as far as I can tell.

@gevro
Copy link
Author

gevro commented Feb 11, 2024

Also, is it possible to make just this change in the next version, without all the other pending 'develop' branch changes? Since I'm not sure / haven't tested how this fix interacts with all the other pending develop branch changes.

@fa8sanger
Copy link
Collaborator

Alex, would you be able to add these changes please? I had forgotten about this

@gevro
Copy link
Author

gevro commented Jul 7, 2024

Hi, I'm curious if this bug will be fixed in the next version?

@fa8sanger
Copy link
Collaborator

fa8sanger commented Jul 8, 2024 via email

@gevro
Copy link
Author

gevro commented Jul 8, 2024

Yes, I tested it, but I didn't see it in the latest release 3.5.5. Is it updated in a dev branch that will be part of the next release?

@fa8sanger
Copy link
Collaborator

fa8sanger commented Jul 8, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants