-
Notifications
You must be signed in to change notification settings - Fork 104
How to create a pigeon‐compatible annotation GTF
Last updated: 09/29/2023
Please use the latest pigeon
version that also contains pigeon prepare
to help validate the correctness of custom annotation GT and reference genomes!
Pigeon is designed to work for Gencode annotation GTF file formats. Other GTF formats will need to be modified to work with pigeon classify
.
- pigeon GTF format requirements
- pigeon GTF examples
- Using pigeon prepare to check genomes and annotations
<name="req">
The pigeon GTF format requirements are:
A tab-delimited 9-column file GFF/GTF File Format
- Column 1 must be the chromosome
- Column 2 is ignored
- Column 3 will only be processed if it is gene, transcript, or exon. All other types (e.g. CDS) are ignored.
- Column 4 & 5 are 1-based start/end
- Column 6 & 8 are ignored
- Column 7 is the strand which must be + or -
- Column 9 is attribute, AKA free text string, but to be properly processed it must contain a minimal of the following, separated by semicolon. Ex: gene_id "ENSG0001"; transcript_id "ENST000A"; gene_name "TP53";
- No extra blank lines at the beginning or end of the file
An isoform record is a one line of "gene" record followed by one or more "transcript" records. Each "transcript" record includes one or more "exon" records. "Gene" records are only considered during pigeon prepare
, to check for unique IDs. Otherwise, during pigeon classify
, only "transcript" records are considered for both collapsed isoforms and annotations. pigeon
uses a "transcript" entry to trigger the next batch and read the next 1..N exons as children of it.
Below is a snippet of a Gencode annotation as a reference:
chr1 ENSEMBL gene 17369 17436 . - . gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR68
59-1"; level 3;
chr1 ENSEMBL transcript 17369 17436 . - . gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "mi
RNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "
basic"; transcript_support_level "NA";
chr1 ENSEMBL exon 17369 17436 . - . gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; ge
ne_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id
"ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.3"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "RP1
1-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "li
ncRNA"; gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; leve
l 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT0000
0002840.1";
chr1 HAVANA exon 29554 30039 . + . gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
1; exon_id "ENSE00001947070.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 30564 30667 . + . gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
2; exon_id "ENSE00001922571.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 30976 31097 . + . gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
3; exon_id "ENSE00001827679.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
Here is an example of a pigeon-compatible annotation after it's been manually modified.
Pf3D7_13_v3 VEuPathDB gene 21364 28787 . + . gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB transcript 21364 28787 . + . gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gen
e_name "PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB exon 21364 26538 . + . gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB exon 27474 28787 . + . gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB CDS 21364 26538 . + 0 Parent=PF3D7_1300100.1
Pf3D7_13_v3 VEuPathDB CDS 27474 28787 . + 0 Parent=PF3D7_1300100.1
Pf3D7_13_v3 VEuPathDB gene 30605 31881 . - . gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB transcript 30605 31881 . - . gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gen
e_name "PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB exon 30605 31597 . - . gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB exon 31828 31881 . - . gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB CDS 30605 31597 . - 0 Parent=PF3D7_1300200.1
Pf3D7_13_v3 VEuPathDB CDS 31828 31881 . - 0 Parent=PF3D7_1300200.1
Example usages:
$ pigeon prepare annotation.gtf collapsed_isoforms.gff reference.fasta cage.bed
or
$ pigeon prepare reference_files.fofn