-
Notifications
You must be signed in to change notification settings - Fork 30
[c++] Optimize memory management on write path #4311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4311 +/- ##
==========================================
- Coverage 86.83% 86.83% -0.01%
==========================================
Files 137 138 +1
Lines 20736 20743 +7
Branches 15 16 +1
==========================================
+ Hits 18007 18013 +6
- Misses 2729 2730 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
bkmartinjr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any measurements of the time/space impact of this PR?
I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well |
| } else { | ||
| // Casting is needed and casted data ownership should pass to the column | ||
|
|
||
| std::unique_ptr<std::byte[]> data_buffer = std::make_unique_for_overwrite<std::byte[]>( | ||
| array->length * sizeof(DiskType)); | ||
|
|
||
| std::span<UserType> original_data_buffer_view(buf, array->length); | ||
| std::span<DiskType> data_buffer_view(reinterpret_cast<DiskType*>(data_buffer.get()), array->length); | ||
|
|
||
| for (int64_t i = 0; i < array->length; ++i) { | ||
| data_buffer_view[i] = static_cast<DiskType>(original_data_buffer_view[i]); | ||
| } | ||
|
|
||
| setup_write_column( | ||
| schema->name, | ||
| array->length, | ||
| std::move(data_buffer), | ||
| (uint64_t*)nullptr, | ||
| _cast_validity_buffer_ptr(array)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just throw. Since arrow/pyarrow/nanoarrow have an explicit schema, we should just require the user pass us data with appropriate types. The only case we want to auto-cast is when the user provides us with a dictionary where the values match the on disk data type of a non-enumerated column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change may break existing unit tests in Python/R and will definitely require new unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this change will require non trivial changes and may break current workflows, should be in a separate PR?
|
|
||
| CArrayColumnBuffer() = delete; | ||
| CArrayColumnBuffer(const CArrayColumnBuffer&) = delete; | ||
| CArrayColumnBuffer(CArrayColumnBuffer&&) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the following warning when compiling:
/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:401:5: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted]
401 | CArrayColumnBuffer(CArrayColumnBuffer&&) = default;
| ^
/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:375:28: note: move constructor of 'CArrayColumnBuffer' is implicitly deleted because base class 'ReadColumnBuffer' has a deleted move constructor
14d71e2 to
f2da391
Compare
1ac1e25 to
1a00005
Compare
…rrow objects (#4334) * Add custom allocator for vector backed buffers, implement memory modes for TileDB to Arrow conversions * Lint fix * Migrate R to use multithreaded arrow conversion * Fix compiler warnings * Rebind buffers after each read operation * Read memory mode from config * Include Rcpp header before other headers * Store object references when setting column data until write is submitted (#4350)
| self._handle.submit_write() | ||
|
|
||
| # clear stored data objects | ||
| self._ref_store.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ManagedQuery allow buffers to be re-used across multiple submit calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffers used for write operations are not reused. The managed query only gets a view of them from wherever they come from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the C/C++ API buffers can be re-used for multi-part global order writes. But IIRC those aren't supported by Python yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the buffers we supply to the C++ API are owned by numpy or Arrow so we do not do anything with them other than that.
Is this just for the write path? Or does this also include reads? It might be nice to see a breakdown. |
|
Yes this was just for writes. Reads were slower before merging the different memory modes PR but after they should be on par or faster. I haven't run the benchmarks yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished yet, I've made my way through column_buffer.{cc,h} so far.
I have left a slew of comments but nothing particularly regarding safety, most of them are cosmetic, for which I defer to y'all as I am not a SOMA maintainer. These few are the most important:
At a higher level I don't really understand why the CArrayColumnBuffer would have any different performance characteristics than the VectorColumnBuffer - I would expect these to be making very similar patterns of memory allocations as long as the move constructors and etc are used appropriately. Data overrides my intuition of course. But I have a feeling that you don't actually need to separate these - the separation of the read and write path would be enough.
| pydict["bar"] = [4.1, 5.2, 6.3, 7.4, 8.5] | ||
| pydict["baz"] = ["apple", "ball", "cat", "dog", "egg"] | ||
| rb = pa.Table.from_pydict(pydict) | ||
| rb = pa.Table.from_pydict(pydict, schema=obs_arrow_schema.insert(0, pa.field("soma_joinid", pa.int64()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come you are choosing to add the field this way instead of inline above?
| ] | ||
| pydict["soma_joinid"] = [1, 2] | ||
| pydict["quality"] = [4.1, 5.2] | ||
| pydict["quality"] = pa.array([4.1, 5.2], type=pa.float32()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change? Is this to test a temporary object used in ManagedQuery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default the datatype would be inferred as int64 and this would fail in ManagedQuery as we removed implicit castings
|
|
||
| if (is_var()) { | ||
| std::unique_ptr<uint64_t[]> offsets_buffer = std::make_unique_for_overwrite<uint64_t[]>(num_cells_ + 1); | ||
| std::memcpy(offsets_buffer.get(), this->offsets().data(), (num_cells_ + 1) * sizeof(uint64_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how many times a developer has written this TileDB-to-arrow buffer conversion. It's definitely a good piece of technical debt that core doesn't produce arrow, when so many upstream users want it.
| } | ||
| } else { | ||
| if (type() != TILEDB_BOOL) { | ||
| data_buffer = std::make_unique_for_overwrite<std::byte[]>(data_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any evidence that this mode is ever better?
The above approach conceptually allocates more memory - twice the max data size. Whereas here you would end up with a potentially smaller number of variably-sized allocations. And keep in mind if this is a read query it's probably filling the buffer with as many cells as you have room for - whatever piece is left is going to be small. In practice the allocator might give you the same size blocks anyway.
Instead of varying by mode you may want to make the choice based on the ratio of data_size_ to max_data_size_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data_size_ to max_data_size_ is a much better approach that is planned for a follow up PR. By default the allocated memory per column is 1GB and for small reads it is inefficient as they have the same lifetime as the resulting Arrow table. These overallocations allocation are most of the time only virtual allocations there has been random memory errors that may be related.
| return std::span<uint8_t>(validity_.data(), num_cells_); | ||
| } | ||
|
|
||
| std::unique_ptr<IArrowBufferStorage> VectorColumnBuffer::export_buffers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost entirely the same as the CBuffer one, I suggest using a helper function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectorColumnBuffer will be removed entirelly once CArrayColumnBuffer has matured
Add test for dictionary casting from #4311 Co-authored-by: XanthosXanthopoulos <[email protected]>
…4359) * (WIP) Safe-cast pyarrow tables on write Still needs the following: * fix schema names for GeometryDataFrame * test unsafe casting * Fix casting for geometry dataframe outlines * Update history * Switch from deprecated `field_by_name` to `field` * Update error message and remove unneeded type declaration * Take tests from PR #4311 Add test for dictionary casting from #4311 Co-authored-by: XanthosXanthopoulos <[email protected]> * Add xfail to uncovered bug * Remove test that is checking for unsafe cast * Fix syntax for xfail --------- Co-authored-by: XanthosXanthopoulos <[email protected]>
…4359) * (WIP) Safe-cast pyarrow tables on write Still needs the following: * fix schema names for GeometryDataFrame * test unsafe casting * Fix casting for geometry dataframe outlines * Update history * Switch from deprecated `field_by_name` to `field` * Update error message and remove unneeded type declaration * Take tests from PR #4311 Add test for dictionary casting from #4311 Co-authored-by: XanthosXanthopoulos <[email protected]> * Add xfail to uncovered bug * Remove test that is checking for unsafe cast * Fix syntax for xfail --------- Co-authored-by: XanthosXanthopoulos <[email protected]> (cherry picked from commit a1f6a68)
…4359) * (WIP) Safe-cast pyarrow tables on write Still needs the following: * fix schema names for GeometryDataFrame * test unsafe casting * Fix casting for geometry dataframe outlines * Update history * Switch from deprecated `field_by_name` to `field` * Update error message and remove unneeded type declaration * Take tests from PR #4311 Add test for dictionary casting from #4311 Co-authored-by: XanthosXanthopoulos <[email protected]> * Add xfail to uncovered bug * Remove test that is checking for unsafe cast * Fix syntax for xfail --------- Co-authored-by: XanthosXanthopoulos <[email protected]> (cherry picked from commit a1f6a68)
…4359) (#4369) * (WIP) Safe-cast pyarrow tables on write Still needs the following: * fix schema names for GeometryDataFrame * test unsafe casting * Fix casting for geometry dataframe outlines * Update history * Switch from deprecated `field_by_name` to `field` * Update error message and remove unneeded type declaration * Take tests from PR #4311 Add test for dictionary casting from #4311 * Add xfail to uncovered bug * Remove test that is checking for unsafe cast * Fix syntax for xfail --------- (cherry picked from commit a1f6a68) Co-authored-by: XanthosXanthopoulos <[email protected]>
Issue and/or context: SOMA-528 SOMA-714 SOMA-688
Changes:
This PR changes the memory management for the read/write operations implemented by
ManagedQuery. Specifically:std::vectorbacked buffers with C++ arrays wrapped instd::unique_ptrSOMAArrayprovided schema to type casts in advanceNotes for Reviewer: