Lotus v3.0 annotations with extra codons #51

terrymun · 2018-01-09T01:49:22Z

Submitter: Robert Syme
Email: [email protected]
The v3.0 annotations (gff) contain 9835 annotations that have an extra codon included after the stop codon. For example, the protein translated from Lj0g3v0000709.1 is encoded in the gff file like so:

chr0    .       gene    300849  302392  .       .       .       ID=Lj0g3v0000709;Name=CUFF.74;sequencetype=Protein coding
chr0    .       mRNA    300849  302392  .       +       .       ID=Lj0g3v0000709.1;Parent=Lj0g3v0000709;Name=Lj0g3v0000709.1;sequencetype=Protein coding;annotation=hypothetical protein SPAPADRAFT_64676 [Spathaspora passalidarum NRRL Y-27907] gi|344305338|gb|EGW35570.1|
chr0    .       exon    300849  301357  .       +       .       ID=Lj0g3v0000709.1.exon.1;Parent=Lj0g3v0000709.1;sequencetype=Protein coding
chr0    .       exon    302170  302392  .       +       .       ID=Lj0g3v0000709.1.exon.2;Parent=Lj0g3v0000709.1;sequencetype=Protein coding
chr0    .       CDS     300970  301357  .       +       0       ID=Lj0g3v0000709.1.CDS.1;Parent=Lj0g3v0000709.1;sequencetype=Protein coding
chr0    .       CDS     302170  302183  .       +       2       ID=Lj0g3v0000709.1.CDS.2;Parent=Lj0g3v0000709.1;sequencetype=Protein coding

The CDS is separated across two exons (388 bp and 14 bp) for a total of 402 bp, or 134 aa. When translated, the 134 amino acids are:

>Lj0g3v0000709.1
MSQIFFLVAATTCHRSFSSSPPFLLISSHHHHNNQGANTTSPYIMFFFLLQSKTNHHCPFFSFSSLWPQKEQHPHAPHEP
PPSRFLLLHGWPNAPQTSVLLPHVAPLDSHDGHQSAPLRTTITFIFQFYGYE*R

Is the extra amino acid deliberatly included in the CDS feature? Should these 9835 proteins with a similar extra codon be included in comparative analysis?

Similarly, there seem to be a number of CDS feaures that contain premature stop codons. For example:

chr0    .       gene    8669307 8670996 .       .       .       ID=Lj0g3v0021349;Name=CUFF.1180;sequencetype=Protein coding
chr0    .       mRNA    8669359 8670996 .       +       .       ID=Lj0g3v0021349.2;Parent=Lj0g3v0021349;Name=Lj0g3v0021349.2;sequencetype=Protein coding;annotation=NoHit
chr0    .       exon    8669359 8669413 .       +       .       ID=Lj0g3v0021349.2.exon.1;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8669509 8669584 .       +       .       ID=Lj0g3v0021349.2.exon.2;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8669905 8670079 .       +       .       ID=Lj0g3v0021349.2.exon.3;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8670485 8670658 .       +       .       ID=Lj0g3v0021349.2.exon.4;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8670754 8670856 .       +       .       ID=Lj0g3v0021349.2.exon.5;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8670937 8670996 .       +       .       ID=Lj0g3v0021349.2.exon.6;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       CDS     8669381 8669413 .       +       0       ID=Lj0g3v0021349.2.CDS.1;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       CDS     8669509 8669584 .       +       0       ID=Lj0g3v0021349.2.CDS.2;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       CDS     8669905 8669933 .       +       2       ID=Lj0g3v0021349.2.CDS.3;Parent=Lj0g3v0021349.2;sequencetype=Protein coding

Which translates to the protein:

>Lj0g3v0021349.2
MRMGVDNMSYEELLALGERIGHVNTGLSEDSLTKKQ*ADFIQGY*I

Are these genes to be translated with an alternative codon table?

Sorry to bother you, and I hope that I've not misunderstood the annotation gff.

Rob Syme

The text was updated successfully, but these errors were encountered:

robsyme · 2018-01-09T02:06:40Z

Taking Lj0g3v0021349.2 as an example, the CDS sequence taken as the concatenation of CDS features from the genome does not match the CDS sequence - the stop codons in the genome sequence have been replaced by other codons in the CDS sequence and the CDS sequence is three base pairs shorter.

Again, apologies if I'm doing something wrong at my end.

terrymun · 2018-01-10T07:44:01Z

@robsyme Thanks for bringing that to our attention. I can confirm that the GFF3 coordinates are indeed incorrect, as far as our additional internal checks goes. We are currently trying to trace the source of the error and will rectify it as soon as possible.

Meanwhile, the CDS data in FASTA files (accessible via the SeqRet toolkit and downloadable here (name: Lotus japonicus v3.0 CDS)) are known to be correct, and I checked a few sequences there and they seem to be correct. You might want to use these sequences instead of those inferred from the GFF3 coordinates for now.

robsyme · 2018-01-10T07:56:45Z

Thanks Terry

For our analyses, we need to know the genomic position of the coding sequence, so we might need to wait until the GFF is fixed. Thanks though!

I can supply a list of affected loci if that would be helpful.

terrymun · 2018-01-12T14:31:49Z

@robsyme Hi Robert, if you can provide a list of affected loci, that will be extremely helpful. Can you send it to my work email, at [email protected]? Many thanks.

We are currently performing an internal data audit and checking old logs (the file was last generated in 2014), to see what could've went wrong.

terrymun changed the title ~~New issue filed~~ Lotus v3.0 annotations with extra codons Jan 9, 2018

terrymun self-assigned this Jan 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lotus v3.0 annotations with extra codons #51

Lotus v3.0 annotations with extra codons #51

terrymun commented Jan 9, 2018 •

edited

Loading

robsyme commented Jan 9, 2018

terrymun commented Jan 10, 2018

robsyme commented Jan 10, 2018

terrymun commented Jan 12, 2018

Lotus v3.0 annotations with extra codons #51

Lotus v3.0 annotations with extra codons #51

Comments

terrymun commented Jan 9, 2018 • edited Loading

robsyme commented Jan 9, 2018

terrymun commented Jan 10, 2018

robsyme commented Jan 10, 2018

terrymun commented Jan 12, 2018

terrymun commented Jan 9, 2018 •

edited

Loading