Conversation
|
Hi @EnverOsmanov, thanks for the PR! val lastCellNum = r.getLastCellNum
colInd
.iterator
.filter(_ < lastCellNum) |
|
If Benchmarks: Here is the code how I read the data. Btw, I just checked the content of |
|
The alternative approach to avoid iteration over full But I'm not exactly sure what was the idea behind the change in V2. |
|
Hmm, maybe it is to be able to do the r.getCell(_, MissingCellPolicy.CREATE_NULL_AS_BLANK)@quanghgx could you chime in here? |
|
If |
6b58ec4 to
6866cb1
Compare
The symptoms:
I have a file with ~1 million rows, 125 columns. It takes ~12 seconds to count lines with spark-excel's API V1 and ~2 minutes with API V2.
The issue:
Rangedoes not contain own optimized methodfilter, that's why it uses method fromTraversableLikewhich iterates over each number in range.r.getLastCellNumevaluated for each number in range.Here are some rough benchmarks with another file:
filter => 50 seconds
val lastCellNum => 38 seconds
withFilter => 20 seconds
takeWhile => 12 seconds
API V1 => 12 seconds
(File taken from here and manually converted to "xlsx")
PS. API V2 seems great! :)