Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing default behavior of BedAnnotate to preserving input lines order #644

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

borisevichdi
Copy link

  • BedAnnotate now preserves initial sort order.
  • Adjusted comments in bedFile.h to better represent actual binning model used for last several years.

This commit alter BedAnnotate and functions, specific to it, so that new version of code will preserve in output file lines order from input file. Currently, output lines are sorted, this is not tweakable behavior, and data are forced to be re-sorted in a rare manner, which is not even supported by SortBed, tool specialized for sorting.

Detailed description:
In before AnnotateBed used genomic binning. Being isolated tool, it never used binning in any way. However, bare usage of binning forced AnnotateBed to re-sort input lines. First - chromosomes; then - from bigger to smaller bins; at last - per start coordinate within a bin. It creates unexpected from documentation changes in output file, see issue #622 (however, in the issue itself it is pointed out incorrectly, why the problem occurs).
Specifically, the output creates an illusion, that AnnotateBed sorts data in chrom:pos order. This was never mentioned in documentation, and thus there was no way to change this behavior. However, actual sorting algorithm is more complicated and involves genomic binning (see below and commit). This results in visually arbitrary sorting of output data, which is impossible to switch off.

Rationale for PR:

  1. Unix-way: given that separate tool for sorting, SortBed, exists, other tools in package should not re-sort bed files during processing.
  2. Control over execution: if other tools re-sort files anyways, there should be a way to change or disable sorting behavior, but there is no such way for AnnotateBed.
  3. Enforced mode is not the common one: currently following sorting model is enforced by AnnotatedBed: sort by chrom -> genomic bin -> position. This model, to my knowledge, is not widely accepted, and chrom -> position is used much more common. In fact, above mentioned SortBed from the same package does NOT support such model, which probably means this current model is not a reasonable way to order lines in output file. Thus, current way of AnnotateBed processing introduces a confusing input lines re-ordering for no particular reason or benefit.

There were several ways to improve situation:
a. allow to switch off current sorting by additional parameter.
b. preserve initial kines order instead of any sorting (including current).
c. allow to select arbitrary mode of sorting.
Solutions (a) and (c) are subject to the problem described in rationale (1) - it is better to chain SortBed + AnnotateBed, if sorting is required.
Solution (a) is more complex than (b). Additionally, due to rationale (3) it seems unlikely that saving current sorting can be beneficial. So, (a) would introduce unnecessary complexity.
Thus, this PR implements solution (b).

2. Adjusted comments in bedFile.h to better represent actual binning model used for last several years.

In before AnnotateBed used genomic binning. Being isolated tool, it never really benefited or used in any way binning. However, bare usage of binning forced AnnotateBed output to be re-sorted within chromosomes; first - from bigger to smaller bins, second - per start coordinate within a bin. It creates unexpected from documentation changes in output file. This commit alter annotateBed and functions, specific to it, so that new version of code will preserve lines order from input file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant