GFF.parse modifies input order of base_dict

The function `GFF.parse` has the option `base_dict` that is a dictionary of SeqRecord object to which gff entries are added upon parsing. If `base_dict` is an `OrderedDict` (default in newer Python versions), the input order gets scrambled due to code in the function [parse_in_parts](https://github.com/chapmanb/bcbb/blob/9c6d83ee3f0491f647a9ecd5947b13c99b478f26/gff/BCBio/GFF/GFFParser.py#L321):

```
 def parse_in_parts(self, gff_files, base_dict=None, limit_info=None,
            target_lines=None):
        """Parse a region of a GFF file specified, returning info as generated.

        target_lines -- The number of lines in the file which should be used
        for each partial parse. This should be determined based on available
        memory.
        """
        for results in self.parse_simple(gff_files, limit_info, target_lines):
            if base_dict is None:
                cur_dict = dict()
            else:
                cur_dict = copy.deepcopy(base_dict)
            cur_dict = self._results_to_features(cur_dict, results)
            all_ids = list(cur_dict.keys())
            all_ids.sort()
            for cur_id in all_ids:
                yield cur_dict[cur_id]
```

The statement `all_ids.sort()` reorders the keys. Is this necessary, and if so, would it be possible to add an option `preserve_order` to `GFF.parse` to allow the possibility to avoid this behaviour?

We are using this function in a package for generating annotated genome assembly files, cf https://github.com/NBISweden/EMBLmyGFF3/issues/83.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GFF.parse modifies input order of base_dict #144

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GFF.parse modifies input order of base_dict #144

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions