-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small String Optimization #362
Comments
I am not certain of the impact it could have but indeed strings are mostly small and immutable and it seems like a good fit. |
Thanks a lot for looking at it! |
With the current way the library is designed, which directly exposes the I can try to hack together a feature flag for it that would expose the new SSO string directly, and use it where possible, but this is going to lead to a bit of a code mess and also further lock us into the SSO implementation we go with, as the new type would be part of the public API, so this cant even be an implementation detail. How would you feel if I write up a proposal about working towards a version 1.0 with some changes to accommodate better ergonomics and better encapsulation? I have looked through some of the issues and can try to address them as I go, as well as at least start some discussion on how writing spreadsheets would go. With the goal, ultimately, to make |
I'd be happy to see how it could go yes but i'm afraid it may take quite a lot of your time. Having a Alternatively having an opaque |
Cool, I will try to come up with a write-up for the reading API, leaving writing for later, and we can go from there on what should actually be done. As for storing the values, another option would be to just store all the values as a We could also provide the methods to use with failure cases for the enum variants, if it is the wrong enum variant. The type info would be useful for writing, so if possible I think its best to track it. The worst case performance option might be a |
Another case for using Lines 1015 to 1020 in 09c25ba
The main issue is when thinking of eventually having an API for writing and editing. There would be a need for the cell to own the type. And we can't really just At least with the I think having the SSO type in the cell is the best middle ground. A write API could look like a |
Sorry for the late reply. I may be wrong but couldn't we use some form of Don't get me wrong, I am really not opposed to having an SSO type in the cell, just want to understand all the constraints. |
#369 while not tackling the per cell issue might help with the memory usage for large shared string in xlsx (as per your benchmarks). Unfortunately I cannot test it properly (the computer I'm using now is under wsl and I can't figure out how to monitor memory usage), Would you mind giving it a try? Similarly, another venue would be to add some function to read rows per rows instead of saving as a big range then iterating over it, assuming it is ok to hold all the data in the |
The second spike could be because of some In general, this is what I'd expect because it is overwhelmingly dominated by the allocation to |
Did a memory profile with
Here is the plot from Here is a pastbin of the full output. The allocation trees are associated with a snapshot. More info can be found from the massif man page. |
Thanks for the analysis, I've done another iteration in #370 to provide some On my machine it is >10% faster, Memory usage is also better but still not impressive.
|
10% is a good speedup. Just shows the allocation pressure that exists. When measured, there were 2,819,870 shared strings. Each of these would be 24 bytes worth of wide pointers, for 67,676,880 bytes in the The implementation of But I'm actually wondering now where the extra overhead is coming from to get us to ~2GB pre sheet load. 200MB should be more like it when just looking at it like this. I wonder if there is a lot of fragmentation going on from each string getting allocated? |
Did a quick patch to use
Over the current |
So at this point in the programs execution: I waited for the print for the number of inline strings, and then stopped the profile right after, so I could see what the memory usage was right after the initial struct was made with the shared strings in it. I got I'm really not sure where the rest of the memory usage is coming from that wouldn't be from the sheet, if the shared strings data was in-memory this early on. @tafia Do you have any idea what else could cause the memory to go up like this that wouldn't be from the sheet read? |
At this point most of the memory is used in cells:
Clearly the sso won't cut it, we should either find a better way to store a cell or use streaming iterators.
Having a streaming iterator on the other hand seems doable and would dwarf any other attempt in terms of both memory and speed. |
If we commit to being read-only, we could just use struct Worksheet<'a, 'b> {
shared_strings: &'a [CompactString],
cells: Vec<Cell<'b>>,
} And the part for the string in the cell would just be a match get_attribute(c_element.attributes(), QName(b"t"))? {
Some(b"s") => {
// shared string
let idx: usize = v.parse()?;
Ok(DataType::String(shared_strings[idx].as_str()))
} The issue of mutating the data means the cell would need to own the type, meaning that a Being read-only would also mean that the public API can just expose a Sparse data optimization could help in certain workloads. But as it really doesn't seem like we can reduce the memory requirements of the One thing I am curious about is the large spike at the end of the program. If the gradual climb from after the shared strings are read into memory is the sheet being read, are cells creation the final big bump? But if that's the case, why is there a climb up to that point? The current |
An option to workaround ownership is to ask the caller for some buffer to hold the non shared strings (like what is done in quick-xml). /// Get worksheet range where shared string values are only borrowed
pub fn worksheet_range_ref<'a>(
&'a mut self,
name: &str,
buffer: &'a mut Vec<u8>,
) -> Option<Result<Range<DataTypeRef<'a>>, XlsxError>> {
// ...
}
} Buffer could be something else than plain
What I believe we're seeing here is just the total
Yes. In general the current api could just simply be collecting all the cells into a range so we won't duplicate too much all the code. |
I've updated #370 to support a new let mut cell_reader = excel
.worksheet_cells_reader("NYC_311_SR_2010-2020-sample-1M")
.unwrap();
let mut l = 0;
while let Some(cell) = cell_reader.next_cell().unwrap() {
l += cell.get_position().0;
} This one, as expected, doesn't use much memory (relatively speaking of course), mostly the
In terms of perf, there isn't much of a difference compared to using |
This looks like a promising direction. So then we can have two modes, an aggressive mode that reads fully into memory and a 'lazy' option that exposes iterators of cells and with some work, rows? use calamine::{Workbook, Xlsx, XlsxError};
fn main() -> Result<(), XlsxError> {
// Type-state for a full open in memory
let workbook: Xlsx<Memory> = Workbook::open("WORKBOOK.xslx")?;
} use calamine::{Workbook, Xlsx, XlsxError};
fn main() -> Result<(), XlsxError> {
// Type-state for a lazy open
let workbook: Xlsx<Lazy> = Workbook::open("WORKBOOK.xslx")?;
} Or if we want to scope the changes to only the sheet level: use calamine::{Workbook, Xlsx, XlsxError};
fn main() -> Result<(), XlsxError> {
let workbook: Xlsx = Workbook::open("WORKBOOK.xslx")?;
let worksheet: Worksheet<Memory> = workbook.worksheet("Sheet1")?;
while let Some(cell) = worksheet.cells() {
// work
}
while let Some(row) = worksheet.rows() {
// work
}
} use calamine::{Workbook, Xlsx, XlsxError};
fn main() -> Result<(), XlsxError> {
let workbook: Xlsx = Workbook::open("WORKBOOK.xslx")?;
let worksheet: Worksheet<Lazy> = workbook.worksheet_lazy("Sheet1")?;
while let Some(cell) = worksheet.cells() {
// work
}
while let Some(row) = worksheet.rows() {
// work
}
} We would need to have coverage of |
I think we should keep changes only at sheet level. The current code already provides both lazy and full api. Next steps
|
Considering the domain, as well as the specification, it seems a lot of gains could be made from using something like smartstring in-place of
String
for some performance gains.It could also replace some of the
Vec<u8>
as keys inBTreeMap
as the in-lined string has better cache locality. And even if it does need to get allocated then its not any worse then aVec
.The text was updated successfully, but these errors were encountered: