Author: Anita Brzoza
- Introduction
- Data downloading
- Quality control and filtering
- Mapping to the reference genome
- Detection, filtration and annotation of SNPs
- Comparative analysis of samples
- References
Saccharomyces cerevisiae was the first eukaryote whose genome was sequenced in 1996. Moreover, because of its small genome, it has become a model organism for studying gene function, and a comprehensive knowledge of yeast genetics and biochemistry is important for tracking the molecular processes that cause functional and genetic differences among strains of these microorganisms. The phenotypic variation is defined by many differernt types of variants, starting from single nucleotide polymorphisms (SNPs), insertions or deletions, etc. to large structural changes such as gene copy number variations (CNVs) and chromosomal translocations (1).
In this study, the analysis will focus on the detection of single nucleotide polymorphisms in vac6 (wild type and mutant with impaired vacuole inheritance) and vac22 (wild type) Saccharomyces cerevisiae strains. The data used are from the SRA study number SRP003355 published on 2010-10-12 in the National Center for Biotechnology Information Sequence Read Archive (2).
The first step was to download raw paired-end sequences assigned to SRA study number SRP003355. To do this, the identification numbers of the data files containing the raw sequences were retrieved by creating accession lists using the functions available in Entrez Direct (using commands such as: 1) esearch - to find the SRA study, 2) efetch - to display the resulting data in xml format, 3) xtract - to extract the run numbers). Then the first 1,000,000 sequences from each run were downloaded using fastq-dump (version 2.11.0) from SRA tools and the reads were separated into separate files.
The code used for this step of the analysis can be found in the src directory as data_downloading.sh.
The table below shows which organism is included in the sample.
Run ID | S. cerevisiae strain |
---|---|
SRR064545 | vac6 wild type |
SRR064546 | vac6 mutant |
SRR064547 | vac22 wild type |
In the next step, quality control of the extracted sequences was done using FastQC (version 0.11.9). A summary report of the obtained results was also prepared using MultiQC (version 1.12), which in data directory as multiqc-report1.html.
The sequences obtained were from Illumina 1.9 sequencing, with read lengths of 36 bp (the exception being reverse reads of the wild-type S. cerevisiae vac22 strain, which were 42 bp long). The percentage of GC pairs in the S. cerevisiae vac6 strain was 39%, while in the vac22 strain it was 40-41%. The lowest mean sequence quality per base was about 22 and the highest about 32. Reads for the vac6 strain showed lower mean read quality than for the vac22 strain. The chart of average quality of readings per base is shown below.
The highest sequence quality recorded was 34 (for 4 reads from file SRR064547_1) and the lowest was 0 (for 60 reads from file SRR064547_2 and 1,357 reads from file SRR064546_2).
Quality control also noted the presence of Illumina Universal Adaptors, as shown in the graph below. The adapters appeared in samples SRR064545_1, SRR064546_2, SRR064547_1, and SRR064547_2.
Given the above information, only the adapter sequences were removed using Trimmomatic (version 0.39) because the average quality of the sequences was satisfactory. A summary quality report after filtering can be found in data directory as multiqc-report2.html.
The code used for this step of the analysis can be found in the src directory as 1) quality_control.sh and 2) filtering.sh.
The reference genome with annotation (version R64) was downloaded from the National Center for Biotechnology Information Genome.
Next, using bowtie2 (version 2.3.5.1) the reference genome index was created and then the sequences from the samples were mapped to the reference genome, creating BAM files at once using SAMTOOLS (version 1.10). A summary of the mapping is shown in the table below.
SRR064545 | SRR064546 | SRR064547 | |
---|---|---|---|
Reads | 996558 | 992531 | 999937 |
Paired reads | 996558 (100%) | 992531 (100%) | 999937 (100%) |
Reads aligned concordantly 0 times | 317985 (31.91%) | 619071 (62.37%) | 988404 (98.85%) |
Reads aligned concordantly exactly 1 time | 584576 (58.66%) | 305931 (30.82%) | 9230 (0.92%) |
Reads aligned concordantly >1 times | 93997 (9.43%) | 67529 (6.80%) | 2303 (0.23%) |
Pairs aligned concordantly 0 times | 317985 | 619071 | 988404 |
Pairs aligned discordantly 1 time | 160537 (50.49%) | 318043 (51.37%) | 558299 (56.48%) |
Pairs aligned 0 times concordantly or discordantly | 157448 | 301028 | 430105 |
Mates make up the pairs | 314896 | 602056 | 860210 |
Mates make up the pairs aligned 0 times | 126907 (40.30%) | 201062 (33.40%) | 318600 (37.04%) |
Mates make up the pairs aligned exactly 1 time | 97826 (31.07%) | 174937 (29.06%) | 234035 (27.21%) |
Mates make up the pairs aligned >1 times | 90163 (17.63%) | 226057 (37.55%) | 307575 (35.76%) |
Overall alignment rate | 93.63% | 89.87% | 84.07% |
In the next step, bam files were improved. To do it, bam files were sorted using SAMTOOLS, reads were added to individual read-groups, and duplicates were marked using Picard (version 2.27.1, available from The Genome Analysis Toolkit (GATK) version 4.2.6.1), then bam files were indexed and read depth was calculated for each bam file using SAMTOOLS. The output files can be found in data directory as SRR064545_coverage.txt, SRR064546_coverage.txt and SRR064547_coverage.txt.
The code used for this step of the analysis can be found in the src directory as 1) downloading_reference_genome.sh, 2) mapping_to_reference.sh, 3) improving_bam_files.sh.
First, a reference genome file was prepared by creating the FASTA sequence dictionary file using GATK and creating fasta index file using SAMTOOLS.
Then using GATK polymorphisms were detected, VCF files were merged into one and genotyped. Only single nucleotide polymorphisms were selected. The output CVF file can be found in data directory as SNP.vcf. 6311 SNPs Variants were detected. The variants were then filtered for missing data, indels, depth, quality values using VCFtools (version 0.1.16). 58 variants located on chromosome 12 and mitochondial genome were obtained. Deleted SNPs derived from the mitochondrial genome and there are 8 variants left. The output CVF file can be found in data directory as NC_001144.5.recode.vcf.
The annotation was made using [VariantAnnotation] (version 1.40.0) and [GenomicFeatures] (version 1.46.1) for R (version 4.1.3).
The code used for this step of the analysis can be found in the src directory as 1) preparing_reference_genome.sh, 2) detecting_snp.sh, 3) filtering_SNP.sh and annotation_and_comparative_analysis.R.
The detected genotypes are shown in the table below. As can be seen, the detected SNP variants were reference for the S. cerevisiae vac 22 wild-type strain, in contrast to the vac 6 strain in which they occurred as heterozygotes.
SNP | SRR064545 | SRR064546 | SRR064547 |
---|---|---|---|
NC_001144.5:460070_A/C | 0/1 | 0/1 | 0/0 |
NC_001144.5:460103_T/C | 0/1 | 0/1 | 0/0 |
NC_001144.5:460152_T/G | 0|1 | 0|1 | 0/0 |
NC_001144.5:460162_T/C | 0|1 | 0|1 | 0/0 |
NC_001144.5:460248_T/C | 0|1 | 0/1 | 0/0 |
NC_001144.5:460329_T/C | 0/1 | 0/1 | 0/0 |
NC_001144.5:460549_C/A | 0/1 | 0/1 | 0/0 |
NC_001144.5:460640_T/C | 0/1 | 0/1 | 0/0 |
Variants are located in the promoter region of the transcript and in the intergenic region. As shown in the table below, most of the annotated options are located in the promoter region. Promoter regions involved regions of genes such as RDN37-1 (this is a 35S ribosomal RNA (35S rRNA) transcript. It is encoded by the RDN1 locus and processed into 25S, 18S and 5.8S rRNA (represented by the RDN25, RDN18 and RDN58 loci)) and ETS1-1 (A non-coding region located immediately upstream of RDN18. This region is transcribed as part of the 35S rRNA precursor transcript and contains the essential U3 snoRNA binding site required for 18S rRNA maturation).
Chromosome | Position | Genome region | Gene ID |
---|---|---|---|
NC_001144.5 | 460070 | promoter | RDN37-1 |
NC_001144.5 | 460070 | promoter | ETS1-1 |
NC_001144.5 | 460070 | intergenic | - |
NC_001144.5 | 460103 | promoter | RDN37-1 |
NC_001144.5 | 460103 | promoter | ETS1-1 |
NC_001144.5 | 460103 | intergenic | - |
NC_001144.5 | 460152 | promoter | RDN37-1 |
NC_001144.5 | 460152 | promoter | ETS1-1 |
NC_001144.5 | 460152 | intergenic | - |
NC_001144.5 | 460162 | promoter | RDN37-1 |
NC_001144.5 | 460162 | promoter | ETS1-1 |
NC_001144.5 | 460162 | intergenic | - |
NC_001144.5 | 460248 | promoter | RDN37-1 |
NC_001144.5 | 460248 | promoter | ETS1-1 |
NC_001144.5 | 460248 | intergenic | - |
NC_001144.5 | 460329 | promoter | RDN37-1 |
NC_001144.5 | 460329 | promoter | ETS1-1 |
NC_001144.5 | 460329 | intergenic | - |
NC_001144.5 | 460549 | intergenic | - |
NC_001144.5 | 460640 | intergenic | - |
The code used for this step of the analysis can be found in the src directory as annotation_and_comparative_analysis.R.