Detection of single nucleotide polymorphisms in Saccharomyces cerevisiae

Author: Anita Brzoza

Introduction

Saccharomyces cerevisiae was the first eukaryote whose genome was sequenced in 1996. Moreover, because of its small genome, it has become a model organism for studying gene function, and a comprehensive knowledge of yeast genetics and biochemistry is important for tracking the molecular processes that cause functional and genetic differences among strains of these microorganisms. The phenotypic variation is defined by many differernt types of variants, starting from single nucleotide polymorphisms (SNPs), insertions or deletions, etc. to large structural changes such as gene copy number variations (CNVs) and chromosomal translocations (1).

In this study, the analysis will focus on the detection of single nucleotide polymorphisms in vac6 (wild type and mutant with impaired vacuole inheritance) and vac22 (wild type) Saccharomyces cerevisiae strains. The data used are from the SRA study number SRP003355 published on 2010-10-12 in the National Center for Biotechnology Information Sequence Read Archive (2).

Back to table of contents.

Data downloading

The first step was to download raw paired-end sequences assigned to SRA study number SRP003355. To do this, the identification numbers of the data files containing the raw sequences were retrieved by creating accession lists using the functions available in Entrez Direct (using commands such as: 1) esearch - to find the SRA study, 2) efetch - to display the resulting data in xml format, 3) xtract - to extract the run numbers). Then the first 1,000,000 sequences from each run were downloaded using fastq-dump (version 2.11.0) from SRA tools and the reads were separated into separate files.

The code used for this step of the analysis can be found in the src directory as data_downloading.sh.

The table below shows which organism is included in the sample.

Run ID	S. cerevisiae strain
SRR064545	vac6 wild type
SRR064546	vac6 mutant
SRR064547	vac22 wild type

Back to table of contents.

Quality control and filtering

In the next step, quality control of the extracted sequences was done using FastQC (version 0.11.9). A summary report of the obtained results was also prepared using MultiQC (version 1.12), which in data directory as multiqc-report1.html.

The sequences obtained were from Illumina 1.9 sequencing, with read lengths of 36 bp (the exception being reverse reads of the wild-type S. cerevisiae vac22 strain, which were 42 bp long). The percentage of GC pairs in the S. cerevisiae vac6 strain was 39%, while in the vac22 strain it was 40-41%. The lowest mean sequence quality per base was about 22 and the highest about 32. Reads for the vac6 strain showed lower mean read quality than for the vac22 strain. The chart of average quality of readings per base is shown below.

The highest sequence quality recorded was 34 (for 4 reads from file SRR064547_1) and the lowest was 0 (for 60 reads from file SRR064547_2 and 1,357 reads from file SRR064546_2).

Quality control also noted the presence of Illumina Universal Adaptors, as shown in the graph below. The adapters appeared in samples SRR064545_1, SRR064546_2, SRR064547_1, and SRR064547_2.

Given the above information, only the adapter sequences were removed using Trimmomatic (version 0.39) because the average quality of the sequences was satisfactory. A summary quality report after filtering can be found in data directory as multiqc-report2.html.

The code used for this step of the analysis can be found in the src directory as 1) quality_control.sh and 2) filtering.sh.

Back to table of contents.

Mapping to the reference genome

The reference genome with annotation (version R64) was downloaded from the National Center for Biotechnology Information Genome.

Next, using bowtie2 (version 2.3.5.1) the reference genome index was created and then the sequences from the samples were mapped to the reference genome, creating BAM files at once using SAMTOOLS (version 1.10). A summary of the mapping is shown in the table below.

	SRR064545	SRR064546	SRR064547
Reads	996558	992531	999937
Paired reads	996558 (100%)	992531 (100%)	999937 (100%)
Reads aligned concordantly 0 times	317985 (31.91%)	619071 (62.37%)	988404 (98.85%)
Reads aligned concordantly exactly 1 time	584576 (58.66%)	305931 (30.82%)	9230 (0.92%)
Reads aligned concordantly >1 times	93997 (9.43%)	67529 (6.80%)	2303 (0.23%)

Pairs aligned concordantly 0 times	317985	619071	988404
Pairs aligned discordantly 1 time	160537 (50.49%)	318043 (51.37%)	558299 (56.48%)

Pairs aligned 0 times concordantly or discordantly	157448	301028	430105
Mates make up the pairs	314896	602056	860210
Mates make up the pairs aligned 0 times	126907 (40.30%)	201062 (33.40%)	318600 (37.04%)
Mates make up the pairs aligned exactly 1 time	97826 (31.07%)	174937 (29.06%)	234035 (27.21%)
Mates make up the pairs aligned >1 times	90163 (17.63%)	226057 (37.55%)	307575 (35.76%)

Overall alignment rate	93.63%	89.87%	84.07%

In the next step, bam files were improved. To do it, bam files were sorted using SAMTOOLS, reads were added to individual read-groups, and duplicates were marked using Picard (version 2.27.1, available from The Genome Analysis Toolkit (GATK) version 4.2.6.1), then bam files were indexed and read depth was calculated for each bam file using SAMTOOLS. The output files can be found in data directory as SRR064545_coverage.txt, SRR064546_coverage.txt and SRR064547_coverage.txt.

The code used for this step of the analysis can be found in the src directory as 1) downloading_reference_genome.sh, 2) mapping_to_reference.sh, 3) improving_bam_files.sh.

Back to table of contents.

Detection, filtration and annotation of SNPs

First, a reference genome file was prepared by creating the FASTA sequence dictionary file using GATK and creating fasta index file using SAMTOOLS.

Then using GATK polymorphisms were detected, VCF files were merged into one and genotyped. Only single nucleotide polymorphisms were selected. The output CVF file can be found in data directory as SNP.vcf. 6311 SNPs Variants were detected. The variants were then filtered for missing data, indels, depth, quality values using VCFtools (version 0.1.16). 58 variants located on chromosome 12 and mitochondial genome were obtained. Deleted SNPs derived from the mitochondrial genome and there are 8 variants left. The output CVF file can be found in data directory as NC_001144.5.recode.vcf.

The annotation was made using [VariantAnnotation] (version 1.40.0) and [GenomicFeatures] (version 1.46.1) for R (version 4.1.3).

The code used for this step of the analysis can be found in the src directory as 1) preparing_reference_genome.sh, 2) detecting_snp.sh, 3) filtering_SNP.sh and annotation_and_comparative_analysis.R.

Back to table of contents.

Comparative analysis of samples

The detected genotypes are shown in the table below. As can be seen, the detected SNP variants were reference for the S. cerevisiae vac 22 wild-type strain, in contrast to the vac 6 strain in which they occurred as heterozygotes.

SNP	SRR064545	SRR064546	SRR064547
NC_001144.5:460070_A/C	0/1	0/1	0/0
NC_001144.5:460103_T/C	0/1	0/1	0/0
NC_001144.5:460152_T/G	0\|1	0\|1	0/0
NC_001144.5:460162_T/C	0\|1	0\|1	0/0
NC_001144.5:460248_T/C	0\|1	0/1	0/0
NC_001144.5:460329_T/C	0/1	0/1	0/0
NC_001144.5:460549_C/A	0/1	0/1	0/0
NC_001144.5:460640_T/C	0/1	0/1	0/0

Variants are located in the promoter region of the transcript and in the intergenic region. As shown in the table below, most of the annotated options are located in the promoter region. Promoter regions involved regions of genes such as RDN37-1 (this is a 35S ribosomal RNA (35S rRNA) transcript. It is encoded by the RDN1 locus and processed into 25S, 18S and 5.8S rRNA (represented by the RDN25, RDN18 and RDN58 loci)) and ETS1-1 (A non-coding region located immediately upstream of RDN18. This region is transcribed as part of the 35S rRNA precursor transcript and contains the essential U3 snoRNA binding site required for 18S rRNA maturation).

Chromosome	Position	Genome region	Gene ID
NC_001144.5	460070	promoter	RDN37-1
NC_001144.5	460070	promoter	ETS1-1
NC_001144.5	460070	intergenic	-
NC_001144.5	460103	promoter	RDN37-1
NC_001144.5	460103	promoter	ETS1-1
NC_001144.5	460103	intergenic	-
NC_001144.5	460152	promoter	RDN37-1
NC_001144.5	460152	promoter	ETS1-1
NC_001144.5	460152	intergenic	-
NC_001144.5	460162	promoter	RDN37-1
NC_001144.5	460162	promoter	ETS1-1
NC_001144.5	460162	intergenic	-
NC_001144.5	460248	promoter	RDN37-1
NC_001144.5	460248	promoter	ETS1-1
NC_001144.5	460248	intergenic	-
NC_001144.5	460329	promoter	RDN37-1
NC_001144.5	460329	promoter	ETS1-1
NC_001144.5	460329	intergenic	-
NC_001144.5	460549	intergenic	-
NC_001144.5	460640	intergenic	-

The code used for this step of the analysis can be found in the src directory as annotation_and_comparative_analysis.R.

Back to table of contents.

References

[1] Basile, A., De Pascale, F., Bianca, F., Rossi, A., Frizzarin, M., De Bernardini, N., … Campanaro, S. (2021). Large-scale sequencing and comparative analysis of oenological Saccharomyces cerevisiae strains supported by nanopore refinement of key genomes. Food Microbiology, 97, 103753. doi:10.1016/j.fm.2021.103753

[2] Birkeland, S. R., Jin, N., Ozdemir, A. C., Lyons, R. H., Jr, Weisman, L. S., Wilson, T. E. (2010). Discovery of mutations in Saccharomyces cerevisiae by pooled linkage analysis and whole-genome sequencing. Genetics, 186(4), 1127–1137. doi.org/10.1534/genetics.110.123232

Back to table of contents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REPORT.md

REPORT.md

Detection of single nucleotide polymorphisms in Saccharomyces cerevisiae

Table of contents

Introduction

Data downloading

Quality control and filtering

Mapping to the reference genome

Detection, filtration and annotation of SNPs

Comparative analysis of samples

References

[2] Birkeland, S. R., Jin, N., Ozdemir, A. C., Lyons, R. H., Jr, Weisman, L. S., Wilson, T. E. (2010). Discovery of mutations in Saccharomyces cerevisiae by pooled linkage analysis and whole-genome sequencing. Genetics, 186(4), 1127–1137. doi.org/10.1534/genetics.110.123232

Files

REPORT.md

Latest commit

History

REPORT.md

File metadata and controls

Detection of single nucleotide polymorphisms in Saccharomyces cerevisiae

Table of contents

Introduction

Data downloading

Quality control and filtering

Mapping to the reference genome

Detection, filtration and annotation of SNPs

Comparative analysis of samples

References

[2] Birkeland, S. R., Jin, N., Ozdemir, A. C., Lyons, R. H., Jr, Weisman, L. S., Wilson, T. E. (2010). Discovery of mutations in Saccharomyces cerevisiae by pooled linkage analysis and whole-genome sequencing. Genetics, 186(4), 1127–1137. doi.org/10.1534/genetics.110.123232