From 97516bdc85781aa7bccdb25664a841f095564257 Mon Sep 17 00:00:00 2001 From: Matthieu Muffato Date: Fri, 24 May 2024 09:34:45 +0000 Subject: [PATCH] Updated the output documentation --- docs/output.md | 175 +++++-------------------------------------------- 1 file changed, 15 insertions(+), 160 deletions(-) diff --git a/docs/output.md b/docs/output.md index a3deef9..2eae20c 100644 --- a/docs/output.md +++ b/docs/output.md @@ -13,8 +13,7 @@ The directories comply with Tree of Life's canonical directory structure. The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: -- [Gene annotation files](#gene-annotation-files) - Assembly files, either straight from the NCBI FTP, or indices built on them -- [Repeat annotation files](#repeat-annotation-files) - Files corresponding to analyses run (by the NCBI) on the original assembly, e.g repeat masking +- [Repeat annotation files](#repeat-annotation-files) - Files corresponding to repeat annotation produced by Ensembl - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution All data files are compressed (and indexed) with `bgzip`. @@ -23,170 +22,26 @@ All Fasta files are indexed with `samtools faidx`, which allows accessing any re All BED files are indexed with tabix in both TBI and CSI modes, unless the sequences are too large. -### Gene annotation files - -Here are the files you can expect in the `gene/` sub-directory. - -```text -/lustre/scratch124/tol/projects/darwin/data/insects/Noctua_fimbriata/ -└── analysis - └── ilNocFimb1.1 - └── gene - └── braker2 - ├── GCA_905163415.1.braker2.2022_03.cdna.fa.gz - ├── GCA_905163415.1.braker2.2022_03.cdna.fa.gz.dict - ├── GCA_905163415.1.braker2.2022_03.cdna.fa.gz.fai - ├── GCA_905163415.1.braker2.2022_03.cdna.fa.gz.gzi - ├── GCA_905163415.1.braker2.2022_03.cdna.seq_length.tsv - ├── GCA_905163415.1.braker2.2022_03.cds.fa.gz - ├── GCA_905163415.1.braker2.2022_03.cds.fa.gz.dict - ├── GCA_905163415.1.braker2.2022_03.cds.fa.gz.fai - ├── GCA_905163415.1.braker2.2022_03.cds.fa.gz.gzi - ├── GCA_905163415.1.braker2.2022_03.cds.seq_length.tsv - ├── GCA_905163415.1.braker2.2022_03.gff3.gz - ├── GCA_905163415.1.braker2.2022_03.gff3.gz.csi - ├── GCA_905163415.1.braker2.2022_03.gff3.gz.gzi - ├── GCA_905163415.1.braker2.2022_03.pep.fa.gz - ├── GCA_905163415.1.braker2.2022_03.pep.fa.gz.dict - ├── GCA_905163415.1.braker2.2022_03.pep.fa.gz.fai - ├── GCA_905163415.1.braker2.2022_03.pep.fa.gz.gzi - └── GCA_905163415.1.braker2.2022_03.pep.seq_length.tsv -``` - -The directory structure includes the assembly name, e.g. `fParRan2.2`, and all files are named after the assembly accession, e.g. `GCA_900634625.2`. -The file name (and the directory name) includes the annotation method and date. Current methods are: - -- `braker2` for [BRAKER2](https://academic.oup.com/nargab/article/3/1/lqaa108/6066535) -- `ensembl` for Ensembl's own annotation pipeline - -The `.seq_length.tsv` files are tabular analogous to the common `chrom.sizes`. They contain the sequence names and their lengths. - -_The following documentation is copied from Ensembl's FTP_ - -#### Fasta files - -Ensembl provide gene sequences in FASTA format in three files. The 'cdna' file contains -transcript sequences for all types of gene (including, for example, -pseudogenes and RNA genes). The 'cds' file contains the DNA sequences -of the coding regions of protein-coding genes. The 'pep' file contains -the amino acid sequences of protein-coding genes. - -The headers in the 'cdna' FASTA files have the format: - -```text -> :::: gene: gene_biotype: transcript_biotype: [gene_symbol:] [description:] -``` - -Example 'cdna' header: - -```text ->ENSZVIT00000000002.1 cdna UG_Zviv_1:LG1:3600:22235:-1 gene:ENSZVIG00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding -``` - -The headers in the 'cds' FASTA files have the format: - -```text -> :::: gene: gene_biotype: transcript_biotype: [gene_symbol:] [description:] -``` - -Example 'cds' header: - -```text ->ENSZVIT00000000002.1 cds UG_Zviv_1:LG1:5289:19862:-1 gene:ENSZVIG00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding -``` - -The headers in the 'pep' FASTA files have the format: - -```text -> :::: gene: transcript: gene_biotype: transcript_biotype: [gene_symbol:] [description:] -``` - -Example 'pep' header: - -```text ->ENSZVIP00000000002.1 pep UG_Zviv_1:LG1:5289:19862:-1 gene:ENSZVIG00000000002.1 transcript:ENSZVIT00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding -``` - -Stable IDs for genes, transcripts, and proteins include a version -suffix. Gene symbols and descriptions are not available for all genes. - -#### GFF3 file - -A GFF3 ([specification](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)) file is also provided. -GFF3 files are validated using [GenomeTools](http://genometools.org). - -The 'type' of gene features is: - -- `gene` for protein-coding genes -- `ncRNA_gene` for RNA genes -- `pseudogene` for pseudogenes - -The 'type' of transcript features is: - -- `mRNA` for protein-coding transcripts -- a specific type or RNA transcript such as `snoRNA` or `lnc_RNA` -- `pseudogenic_transcript` for pseudogenes - -All transcripts are linked to `exon` features. -Protein-coding transcripts are linked to `CDS`, `five_prime_UTR`, and -`three_prime_UTR` features. - -Attributes for feature types: -(italics indicate data which is not available for all features) - -- region types: - - `ID`: Unique identifier, format `:` - - _`Alias`_: A comma-separated list of aliases, usually including the - `INSDC` accession - - _`Is_circular`_: Flag to indicate circular regions -- gene types: - - `ID`: Unique identifier, format `gene:` - - `biotype`: Ensembl biotype, e.g. `protein_coding`, `pseudogene` - - `gene_id`: Ensembl gene stable ID - - `version`: Ensembl gene version - - _`Name`_: Gene name - - _`description`_: Gene description -- transcript types: - - `ID`: Unique identifier, format `transcript:` - - `Parent`: Gene identifier, format `gene:` - - `biotype`: Ensembl biotype, e.g. `protein_coding`, `pseudogene` - - `transcript_id`: Ensembl transcript stable ID - - `version`: Ensembl transcript version - - _`Note`_: If the transcript sequence has been edited (i.e. differs - from the genomic sequence), the edits are described in a note. -- exon - - `Parent`: Transcript identifier, format `transcript:` - - `exon_id`: Ensembl exon stable ID - - `version`: Ensembl exon version - - `constitutive`: Flag to indicate if exon is present in all - transcripts - - `rank`: Integer that show the 5'->3' ordering of exons -- CDS - - `ID`: Unique identifier, format `CDS:` - - `Parent`: Transcript identifier, format `transcript:` - - `protein_id`: Ensembl protein stable ID - - `version`: Ensembl protein version - ### Repeat annotation files -Here are the files you can expect in the `repeats/` sub-directory. +Here are the files you can expect in the results directory. ```text -analysis -└── gfLaeSulp1.1 - └── repeats - └── ncbi - ├── GCA_927399515.1.masked.ncbi.bed.gz - ├── GCA_927399515.1.masked.ncbi.bed.gz.gzi - ├── GCA_927399515.1.masked.ncbi.bed.gz.tbi - ├── GCA_927399515.1.masked.ncbi.fasta.dict - ├── GCA_927399515.1.masked.ncbi.fasta.gz - ├── GCA_927399515.1.masked.ncbi.fasta.gz.fai - └── GCA_927399515.1.masked.ncbi.fasta.gz.gzi +└── repeats + └── ensembl + ├── GCA_907164925.1.masked.ensembl.bed.gz + ├── GCA_907164925.1.masked.ensembl.bed.gz.csi + ├── GCA_907164925.1.masked.ensembl.bed.gz.gzi + ├── GCA_907164925.1.masked.ensembl.bed.gz.tbi + ├── GCA_907164925.1.masked.ensembl.fa.dict + ├── GCA_907164925.1.masked.ensembl.fa.gz + ├── GCA_907164925.1.masked.ensembl.fa.gz.fai + ├── GCA_907164925.1.masked.ensembl.fa.gz.gzi + └── GCA_907164925.1.masked.ensembl.fa.gz.sizes ``` -They all correspond to the repeat-masking analysis run by Ensembl themselves. Like for the `assembly/` sub-directory, -the directory structure includes the assembly name, e.g. `gfLaeSulp1.1`, and all files are named after the assembly accession, e.g. `GCA_927399515.1`. +They all correspond to the repeat-masking analysis run by Ensembl themselves. +All files are named after the assembly accession, e.g. `GCA_907164925.1`. - `GCA_*.masked.ncbi.fasta.gz`: Masked assembly in Fasta format - `GCA_*.masked.ncbi.bed.gz`: BED file with the coordinates of the regions masked by the Ensembl pipeline