V2 streaming read (alternative approach) by pjfanning · Pull Request #653 · nightscape/spark-excel

pjfanning · 2022-10-02T19:53:15Z

alternative to #651

pjfanning · 2022-10-02T20:46:22Z

@christianknoepfle this seems to be working now - have a look if you have time

christianknoepfle · 2022-10-03T15:54:49Z

cool :) I think I can kill my PR ;) (the core idea of the ExcelPartitionReaderFromIterator has made it in your approach but with a cleaner interface, so I am happy ;) ).

The only thing that I didn't really liked is the fact that the CloseableIterator is not really an iterator. It is something that contains an iterator. So the name is a bit misleading (unfortunately I have no better idea for now). So finding another name or as an alternative one could implement it as an iterator
case class CloseableIterator[T] (val private iterator, ...) extends Iterator[T] ...
and just implement the interface forwarding the calls to the iterator. This would change all calls from it.iterator to it itself. Still you need to implement the ++ and maybe one or two other methods to make it work.

In any case: Thanks a lot for your effort. Hope the other folks are fine with your fixes and we see a new version soon (and then hopefully we could use spark excel it in our production code)

christianknoepfle · 2022-10-03T15:56:17Z

src/main/scala/com/crealytics/spark/v2/excel/ExcelHelper.scala

+      workbook match {
+        case _: StreamingWorkbook => CloseableIterator(rowIter, Seq(workbook))
+        case _ => {
+          workbook.close()


why do you close the workbook here and do not hand that to CloseableIterator()? Would be more consistent to me

this workbook doesn't need to be kept open but I suppose it could be changed to work like the streaming workbook

christianknoepfle · 2022-10-03T16:16:22Z

src/test/scala/com/crealytics/spark/v2/excel/MaxRowsReadSuite.scala

+      val dfExcel = spark.read
+        .format("excel")
+        .option("path", "src/test/resources/v2readwritetest/large_excel/largefile-wide-single-sheet.xlsx")
+        .option("header", value = false)


should we test with inferSchema= true and header = true too to get greater coverage? See src/test/scala/com/crealytics/spark/v2/excel/MaxNumRowsSuite.scala in my PR as an idea.

if you liek that, you will need the new xlsx too.. (has headers)

pjfanning · 2022-10-03T17:48:44Z

The only thing that I didn't really liked is the fact that the CloseableIterator is not really an iterator.

I agree that the name isn't great. I was thinking of renaming it to SheetData or something like and renaming the .iterator function to .rowIterator.

pjfanning · 2022-10-03T20:34:33Z

@christianknoepfle I made the changes you asked for and copied your test changes. One issue is I get dataframe sizes that are 1 less than you - for each of the 3 sub tests. I haven't investigated yet whether my code strips a row that shouldn't be stripped or if your PR overcounts.

src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala

src/main/3.0_3.1/scala/com/crealytics/spark/v2/excel/ExcelTable.scala

nightscape · 2022-10-03T20:33:53Z

src/main/3.0_3.1/scala/com/crealytics/spark/v2/excel/ExcelTable.scala

+        val numberOfRowToIgnore = if (options.header) (options.ignoreAfterHeader + 1) else 0
+        paths.tail.foreach(path => {
+          val newRows = excelHelper.getSheetData(conf, path)
+          sheetData = SheetData(


There's a lot of copied code here. I wonder at what point we DRY this and put the code that is shared for the different Spark versions into shared code.

nightscape · 2022-10-03T20:34:43Z

src/main/3.2/scala/com/crealytics/spark/v2/excel/ExcelTable.scala

-          val r = excelHelper.getColumnNames(headerRow)
-          rows = Iterator(headerRow) ++ rows
-          r
+      try {


Same here. I had not seen that there is so much overlap in the code, but we should get rid of it.

I've done some refactoring to reduce code duplication - probably more can be done

src/main/scala/com/crealytics/spark/v2/excel/SheetData.scala

nightscape · 2022-10-03T20:39:17Z

src/test/scala/com/crealytics/spark/excel/MaxRowsReadSuite.scala

+        .option("path", "src/test/resources/v2readwritetest/large_excel/largefile-wide-single-sheet.xlsx")
+        .option("header", value = false)
+        // .option("dataAddress", "'Sheet1'!B7:M16")
+        .option("maxRowsInMemory", "200")


As mentioned in the other PR, the integration tests actually cover streaming reads:
https://github.com/crealytics/spark-excel/blob/main/src/test/scala/com/crealytics/spark/excel/IntegrationSuite.scala#L349
If there is anything missing there, I would prefer to extend them instead of adding new tests with large binary files.

I have been unable to get any of the existing tests to reproduce the issue - the datasets are too small

nightscape · 2022-10-03T20:43:18Z

Overall this Iterator stuff is dangerous territory. It's bitten me more than once, because at some point some code was reading from an Iterator from which another Iterator was derived...
It would be great to encapsulate that in such a way that such errors cannot occur. SheetData in principle could provide such isolation if the Iterator were not exposed, but it would still require some more logic to prevent double consumption.

pjfanning · 2022-10-03T21:57:04Z

Overall this Iterator stuff is dangerous territory. It's bitten me more than once, because at some point some code was reading from an Iterator from which another Iterator was derived... It would be great to encapsulate that in such a way that such errors cannot occur. SheetData in principle could provide such isolation if the Iterator were not exposed, but it would still require some more logic to prevent double consumption.

For me, test coverage is the main way to manage this. The code already relies on reading off headers and things like so I'm not sure how much can be done to control that the iterators are not consumed at the wrong time.

I covered some of the topics you asked me to look at but there is more I can do.

I haven't yet looked at a way to test the changes without the new file. I'll get back to that later.

christianknoepfle · 2022-10-04T12:42:57Z

@christianknoepfle I made the changes you asked for and copied your test changes. One issue is I get dataframe sizes that are 1 less than you - for each of the 3 sub tests. I haven't investigated yet whether my code strips a row that shouldn't be stripped or if your PR overcounts.

you will need the new excel. I just added a header row to it. Download from here:

pjfanning · 2022-10-04T14:10:34Z

@christianknoepfle I made the changes you asked for and copied your test changes. One issue is I get dataframe sizes that are 1 less than you - for each of the 3 sub tests. I haven't investigated yet whether my code strips a row that shouldn't be stripped or if your PR overcounts.

you will need the new excel. I just added a header row to it. Download from here:

thanks - I've updated my PR to use your latest xlsx file

nightscape · 2022-10-16T08:30:37Z

Hey @pjfanning, I saw you picked up my changes regarding spark-testing-base to spark-fast-tests already 👍 👍

I created a branch which applies the integration tests to v2 as well.
It still needs some adaptations so that it doesn't assume that a single file gets written, but my hope is that it uncovers a few bugs (e.g. this one) that are still lurking in v2.

christianknoepfle · 2022-11-04T06:47:34Z

Is there anything holding off this PR? I can say that it finally fixed one of our production issues and is key to get V2 stable. If you need help pls let me know

nightscape · 2022-11-07T15:58:46Z

If possible, I'd like to expose the bug without adding a huge .xlsx file.
#657 might already expose it, but it also has/exposes other issues which might cover this one.
So I would propose the following approach:

@pjfanning removes the .xlsx file and the corresponding test from this PR.
I squash-merge the PR.
Whoever has some spare time tries to get test: Run integration tests against v2 as well #657 to work.
If that PR does not expose the bug (one probably needs to temporarily break the fix from this PR to check) we add the test with the .xlsx.

WDYT?

Co-authored-by: Martin Mauch <martin.mauch@gmail.com>

pjfanning · 2022-11-07T18:05:19Z

I've removed the large xlsx and the tests that use it

nightscape · 2022-11-07T22:05:14Z

@pjfanning thanks a lot for your efforts and continued support!!

pjfanning marked this pull request as draft October 2, 2022 20:13

pjfanning changed the title ~~WIP: V2 streaming read (alternative approach)~~ V2 streaming read (alternative approach) Oct 2, 2022

pjfanning requested review from nightscape and quanghgx October 2, 2022 20:45

pjfanning self-assigned this Oct 2, 2022

pjfanning marked this pull request as ready for review October 2, 2022 20:45

christianknoepfle reviewed Oct 3, 2022

View reviewed changes

pjfanning mentioned this pull request Oct 3, 2022

v1 streaming read test #652

Closed

nightscape requested changes Oct 3, 2022

View reviewed changes

pjfanning marked this pull request as draft October 3, 2022 21:57

christianknoepfle mentioned this pull request Oct 4, 2022

closed stream issue (issue #650) #651

Closed

pjfanning marked this pull request as ready for review October 6, 2022 11:13

nightscape force-pushed the main branch 6 times, most recently from 240d8d7 to fbb6a1a Compare October 15, 2022 19:24

pjfanning and others added 20 commits November 7, 2022 18:45

v1 streaming read test

8317a19

Update MaxNumRowsSuite.scala

0b503e0

Update MaxNumRowsSuite.scala

252e1e7

Update MaxRowsReadSuite.scala

8c288af

fix issue with v2 data source when streaming excel

dc8f23f

wip

d0c2871

fix spark 2.4 build

3e5ba51

Update ExcelTable.scala

37fc351

fix for TODO issue

8d9b82f

review items

9dbeafd

Update MaxRowsReadSuite.scala

60dd007

Update src/main/scala/com/crealytics/spark/v2/excel/SheetData.scala

7df1a12

Co-authored-by: Martin Mauch <martin.mauch@gmail.com>

some review comments

6bb5110

try to centralise some code

084b9cf

Update ExcelTable.scala

58ca086

continue refactor

0f89aed

license

f34c60e

update tests

56cee0a

compile issues

6fad4e3

delete large file test

7f7f0dc

pjfanning force-pushed the v2-streaming branch from ad77586 to 7f7f0dc Compare November 7, 2022 17:47

nightscape merged commit 4ceca4f into main Nov 7, 2022

nightscape deleted the v2-streaming branch November 7, 2022 22:04

Comments

Conversation

pjfanning commented Oct 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjfanning commented Oct 2, 2022

Uh oh!

christianknoepfle commented Oct 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christianknoepfle Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pjfanning commented Oct 3, 2022

Uh oh!

pjfanning commented Oct 3, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nightscape commented Oct 3, 2022

Uh oh!

pjfanning commented Oct 3, 2022

Uh oh!

christianknoepfle commented Oct 4, 2022

Uh oh!

pjfanning commented Oct 4, 2022

Uh oh!

nightscape commented Oct 16, 2022

Uh oh!

christianknoepfle commented Nov 4, 2022

Uh oh!

nightscape commented Nov 7, 2022

Uh oh!

pjfanning commented Nov 7, 2022

Uh oh!

nightscape commented Nov 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pjfanning commented Oct 2, 2022 •

edited

Loading

christianknoepfle Oct 3, 2022 •

edited

Loading