Changing default behavior of BedAnnotate to preserving input lines order #644
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit alter BedAnnotate and functions, specific to it, so that new version of code will preserve in output file lines order from input file. Currently, output lines are sorted, this is not tweakable behavior, and data are forced to be re-sorted in a rare manner, which is not even supported by SortBed, tool specialized for sorting.
Detailed description:
In before AnnotateBed used genomic binning. Being isolated tool, it never used binning in any way. However, bare usage of binning forced AnnotateBed to re-sort input lines. First - chromosomes; then - from bigger to smaller bins; at last - per start coordinate within a bin. It creates unexpected from documentation changes in output file, see issue #622 (however, in the issue itself it is pointed out incorrectly, why the problem occurs).
Specifically, the output creates an illusion, that AnnotateBed sorts data in chrom:pos order. This was never mentioned in documentation, and thus there was no way to change this behavior. However, actual sorting algorithm is more complicated and involves genomic binning (see below and commit). This results in visually arbitrary sorting of output data, which is impossible to switch off.
Rationale for PR:
There were several ways to improve situation:
a. allow to switch off current sorting by additional parameter.
b. preserve initial kines order instead of any sorting (including current).
c. allow to select arbitrary mode of sorting.
Solutions (a) and (c) are subject to the problem described in rationale (1) - it is better to chain SortBed + AnnotateBed, if sorting is required.
Solution (a) is more complex than (b). Additionally, due to rationale (3) it seems unlikely that saving current sorting can be beneficial. So, (a) would introduce unnecessary complexity.
Thus, this PR implements solution (b).