You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It works fine if the file is about 70 GB.
However when file size is about 170 GB, some reads are missing (the missing reads are well-formed).
And the missing reads can be found if read the file line by line
I have seen issues occasionally with gzipped/bgzf FASTQ input before, although typically with paired reads, where ADAM complains about not having the same numbers of each. If you know of any publicly available datasets that demonstrate this issue, I can dig into it deeper.
As a workaround, you may be able to convert to unaligned BAM format first and then read into ADAM.
Another workaround would be to convert FASTQ into CSV or tab-delimited format and then use Spark SQL to read the text file and convert into ADAM format, something like
importorg.bdgenomics.adam.ds.ADAMContext._valsql="""SELECT _c0 AS name, CAST(NULL AS STRING) AS description, 'DNA' AS alphabet, upper(_c1) AS sequence, length(_c1) AS length, _c2 AS qualityScores, CAST(NULL AS STRING) AS sampleId, CAST(NULL AS MAP<STRING,STRING>) AS attributesFROM reads"""valdf= spark.read.option("delimiter", "\t").csv(inputPath)
df.createOrReplaceTempView("reads")
valreadsDf= spark.sql(sql)
valreads= sc.loadReads(readsDf)
reads.saveAsParquet(outputPath)
adam-core version: 0.33.0
Spark version: 3.3.0
Scala version: 2.12
I read FASTQ BGZ file with following code :
It works fine if the file is about 70 GB.
However when file size is about 170 GB, some reads are missing (the missing reads are well-formed).
And the missing reads can be found if read the file line by line
Is there any limitation about
SingleFastqInputFormat
or any advice can help me debug this issue ?The text was updated successfully, but these errors were encountered: