V2 streaming read (alternative approach)#653
Conversation
|
@christianknoepfle this seems to be working now - have a look if you have time |
|
cool :) I think I can kill my PR ;) (the core idea of the ExcelPartitionReaderFromIterator has made it in your approach but with a cleaner interface, so I am happy ;) ). The only thing that I didn't really liked is the fact that the CloseableIterator is not really an iterator. It is something that contains an iterator. So the name is a bit misleading (unfortunately I have no better idea for now). So finding another name or as an alternative one could implement it as an iterator In any case: Thanks a lot for your effort. Hope the other folks are fine with your fixes and we see a new version soon (and then hopefully we could use spark excel it in our production code) |
| workbook match { | ||
| case _: StreamingWorkbook => CloseableIterator(rowIter, Seq(workbook)) | ||
| case _ => { | ||
| workbook.close() |
There was a problem hiding this comment.
why do you close the workbook here and do not hand that to CloseableIterator()? Would be more consistent to me
There was a problem hiding this comment.
this workbook doesn't need to be kept open but I suppose it could be changed to work like the streaming workbook
| val dfExcel = spark.read | ||
| .format("excel") | ||
| .option("path", "src/test/resources/v2readwritetest/large_excel/largefile-wide-single-sheet.xlsx") | ||
| .option("header", value = false) |
There was a problem hiding this comment.
should we test with inferSchema= true and header = true too to get greater coverage? See src/test/scala/com/crealytics/spark/v2/excel/MaxNumRowsSuite.scala in my PR as an idea.
There was a problem hiding this comment.
if you liek that, you will need the new xlsx too.. (has headers)
I agree that the name isn't great. I was thinking of renaming it to |
|
@christianknoepfle I made the changes you asked for and copied your test changes. One issue is I get dataframe sizes that are 1 less than you - for each of the 3 sub tests. I haven't investigated yet whether my code strips a row that shouldn't be stripped or if your PR overcounts. |
src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
Outdated
Show resolved
Hide resolved
src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
Outdated
Show resolved
Hide resolved
src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
Outdated
Show resolved
Hide resolved
src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
Outdated
Show resolved
Hide resolved
src/main/3.0_3.1/scala/com/crealytics/spark/v2/excel/ExcelTable.scala
Outdated
Show resolved
Hide resolved
src/main/3.0_3.1/scala/com/crealytics/spark/v2/excel/ExcelTable.scala
Outdated
Show resolved
Hide resolved
| val numberOfRowToIgnore = if (options.header) (options.ignoreAfterHeader + 1) else 0 | ||
| paths.tail.foreach(path => { | ||
| val newRows = excelHelper.getSheetData(conf, path) | ||
| sheetData = SheetData( |
There was a problem hiding this comment.
There's a lot of copied code here. I wonder at what point we DRY this and put the code that is shared for the different Spark versions into shared code.
| val r = excelHelper.getColumnNames(headerRow) | ||
| rows = Iterator(headerRow) ++ rows | ||
| r | ||
| try { |
There was a problem hiding this comment.
Same here. I had not seen that there is so much overlap in the code, but we should get rid of it.
There was a problem hiding this comment.
I've done some refactoring to reduce code duplication - probably more can be done
| .option("path", "src/test/resources/v2readwritetest/large_excel/largefile-wide-single-sheet.xlsx") | ||
| .option("header", value = false) | ||
| // .option("dataAddress", "'Sheet1'!B7:M16") | ||
| .option("maxRowsInMemory", "200") |
There was a problem hiding this comment.
As mentioned in the other PR, the integration tests actually cover streaming reads:
https://github.com/crealytics/spark-excel/blob/main/src/test/scala/com/crealytics/spark/excel/IntegrationSuite.scala#L349
If there is anything missing there, I would prefer to extend them instead of adding new tests with large binary files.
There was a problem hiding this comment.
I have been unable to get any of the existing tests to reproduce the issue - the datasets are too small
|
Overall this Iterator stuff is dangerous territory. It's bitten me more than once, because at some point some code was reading from an Iterator from which another Iterator was derived... |
For me, test coverage is the main way to manage this. The code already relies on reading off headers and things like so I'm not sure how much can be done to control that the iterators are not consumed at the wrong time. I covered some of the topics you asked me to look at but there is more I can do. I haven't yet looked at a way to test the changes without the new file. I'll get back to that later. |
you will need the new excel. I just added a header row to it. Download from here: |
thanks - I've updated my PR to use your latest xlsx file |
240d8d7 to
fbb6a1a
Compare
|
Hey @pjfanning, I saw you picked up my changes regarding spark-testing-base to spark-fast-tests already 👍 👍 I created a branch which applies the integration tests to v2 as well. |
|
Is there anything holding off this PR? I can say that it finally fixed one of our production issues and is key to get V2 stable. If you need help pls let me know |
|
If possible, I'd like to expose the bug without adding a huge
WDYT? |
Co-authored-by: Martin Mauch <martin.mauch@gmail.com>
ad77586 to
7f7f0dc
Compare
|
I've removed the large xlsx and the tests that use it |
|
@pjfanning thanks a lot for your efforts and continued support!! |
alternative to #651