Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lotus v3.0 annotations with extra codons #51

Open
terrymun opened this issue Jan 9, 2018 · 4 comments
Open

Lotus v3.0 annotations with extra codons #51

terrymun opened this issue Jan 9, 2018 · 4 comments
Assignees

Comments

@terrymun
Copy link
Member

terrymun commented Jan 9, 2018

Submitter: Robert Syme
Email: [email protected]
The v3.0 annotations (gff) contain 9835 annotations that have an extra codon included after the stop codon. For example, the protein translated from Lj0g3v0000709.1 is encoded in the gff file like so:

chr0    .       gene    300849  302392  .       .       .       ID=Lj0g3v0000709;Name=CUFF.74;sequencetype=Protein coding
chr0    .       mRNA    300849  302392  .       +       .       ID=Lj0g3v0000709.1;Parent=Lj0g3v0000709;Name=Lj0g3v0000709.1;sequencetype=Protein coding;annotation=hypothetical protein SPAPADRAFT_64676 [Spathaspora passalidarum NRRL Y-27907] gi|344305338|gb|EGW35570.1|
chr0    .       exon    300849  301357  .       +       .       ID=Lj0g3v0000709.1.exon.1;Parent=Lj0g3v0000709.1;sequencetype=Protein coding
chr0    .       exon    302170  302392  .       +       .       ID=Lj0g3v0000709.1.exon.2;Parent=Lj0g3v0000709.1;sequencetype=Protein coding
chr0    .       CDS     300970  301357  .       +       0       ID=Lj0g3v0000709.1.CDS.1;Parent=Lj0g3v0000709.1;sequencetype=Protein coding
chr0    .       CDS     302170  302183  .       +       2       ID=Lj0g3v0000709.1.CDS.2;Parent=Lj0g3v0000709.1;sequencetype=Protein coding

The CDS is separated across two exons (388 bp and 14 bp) for a total of 402 bp, or 134 aa. When translated, the 134 amino acids are:

>Lj0g3v0000709.1
MSQIFFLVAATTCHRSFSSSPPFLLISSHHHHNNQGANTTSPYIMFFFLLQSKTNHHCPFFSFSSLWPQKEQHPHAPHEP
PPSRFLLLHGWPNAPQTSVLLPHVAPLDSHDGHQSAPLRTTITFIFQFYGYE*R

Is the extra amino acid deliberatly included in the CDS feature? Should these 9835 proteins with a similar extra codon be included in comparative analysis?

Similarly, there seem to be a number of CDS feaures that contain premature stop codons. For example:

chr0    .       gene    8669307 8670996 .       .       .       ID=Lj0g3v0021349;Name=CUFF.1180;sequencetype=Protein coding
chr0    .       mRNA    8669359 8670996 .       +       .       ID=Lj0g3v0021349.2;Parent=Lj0g3v0021349;Name=Lj0g3v0021349.2;sequencetype=Protein coding;annotation=NoHit
chr0    .       exon    8669359 8669413 .       +       .       ID=Lj0g3v0021349.2.exon.1;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8669509 8669584 .       +       .       ID=Lj0g3v0021349.2.exon.2;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8669905 8670079 .       +       .       ID=Lj0g3v0021349.2.exon.3;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8670485 8670658 .       +       .       ID=Lj0g3v0021349.2.exon.4;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8670754 8670856 .       +       .       ID=Lj0g3v0021349.2.exon.5;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       exon    8670937 8670996 .       +       .       ID=Lj0g3v0021349.2.exon.6;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       CDS     8669381 8669413 .       +       0       ID=Lj0g3v0021349.2.CDS.1;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       CDS     8669509 8669584 .       +       0       ID=Lj0g3v0021349.2.CDS.2;Parent=Lj0g3v0021349.2;sequencetype=Protein coding
chr0    .       CDS     8669905 8669933 .       +       2       ID=Lj0g3v0021349.2.CDS.3;Parent=Lj0g3v0021349.2;sequencetype=Protein coding

Which translates to the protein:

>Lj0g3v0021349.2
MRMGVDNMSYEELLALGERIGHVNTGLSEDSLTKKQ*ADFIQGY*I

Are these genes to be translated with an alternative codon table?

Sorry to bother you, and I hope that I've not misunderstood the annotation gff.

  • Rob Syme
@robsyme
Copy link

robsyme commented Jan 9, 2018

Taking Lj0g3v0021349.2 as an example, the CDS sequence taken as the concatenation of CDS features from the genome does not match the CDS sequence - the stop codons in the genome sequence have been replaced by other codons in the CDS sequence and the CDS sequence is three base pairs shorter.

Again, apologies if I'm doing something wrong at my end.

@terrymun terrymun changed the title New issue filed Lotus v3.0 annotations with extra codons Jan 9, 2018
@terrymun terrymun self-assigned this Jan 10, 2018
@terrymun
Copy link
Member Author

@robsyme Thanks for bringing that to our attention. I can confirm that the GFF3 coordinates are indeed incorrect, as far as our additional internal checks goes. We are currently trying to trace the source of the error and will rectify it as soon as possible.

Meanwhile, the CDS data in FASTA files (accessible via the SeqRet toolkit and downloadable here (name: Lotus japonicus v3.0 CDS)) are known to be correct, and I checked a few sequences there and they seem to be correct. You might want to use these sequences instead of those inferred from the GFF3 coordinates for now.

@robsyme
Copy link

robsyme commented Jan 10, 2018

Thanks Terry

For our analyses, we need to know the genomic position of the coding sequence, so we might need to wait until the GFF is fixed. Thanks though!

I can supply a list of affected loci if that would be helpful.

@terrymun
Copy link
Member Author

@robsyme Hi Robert, if you can provide a list of affected loci, that will be extremely helpful. Can you send it to my work email, at [email protected]? Many thanks.

We are currently performing an internal data audit and checking old logs (the file was last generated in 2014), to see what could've went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants