Skip to content

Conversation

@XanthosXanthopoulos
Copy link
Collaborator

@XanthosXanthopoulos XanthosXanthopoulos commented Nov 13, 2025

Issue and/or context: SOMA-528 SOMA-714 SOMA-688

Changes:
This PR changes the memory management for the read/write operations implemented by ManagedQuery. Specifically:

  • Replaces std::vector backed buffers with C++ arrays wrapped in std::unique_ptr
  • Optimizes null count calculation for nullable columns
  • Makes TileDB buffers to Arrow table conversion multithreaded
  • When possible the writes are now zero copy. Passing temporary object to set the data buffer for writes will crash the program because setting the buffers and writing them to TileDB is not an atomic operation
  • Removes implicit casting of numeric data when writing to TileDB. When passing data to read/write you should use the SOMAArray provided schema to type casts in advance
  • Fix index casting when writing dictionaries
  • Properly write validity buffers for nullable enumerated columns

Notes for Reviewer:

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.83%. Comparing base (a43bade) to head (1a00005).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4311      +/-   ##
==========================================
- Coverage   86.83%   86.83%   -0.01%     
==========================================
  Files         137      138       +1     
  Lines       20736    20743       +7     
  Branches       15       16       +1     
==========================================
+ Hits        18007    18013       +6     
- Misses       2729     2730       +1     
Flag Coverage Δ
python 89.16% <ø> (-0.03%) ⬇️
r 85.62% <87.50%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.16% <ø> (-0.03%) ⬇️
libtiledbsoma 76.77% <75.00%> (-0.47%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as ready for review November 14, 2025 17:03
Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any measurements of the time/space impact of this PR?

@XanthosXanthopoulos
Copy link
Collaborator Author

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well

Comment on lines 704 to 723
} else {
// Casting is needed and casted data ownership should pass to the column

std::unique_ptr<std::byte[]> data_buffer = std::make_unique_for_overwrite<std::byte[]>(
array->length * sizeof(DiskType));

std::span<UserType> original_data_buffer_view(buf, array->length);
std::span<DiskType> data_buffer_view(reinterpret_cast<DiskType*>(data_buffer.get()), array->length);

for (int64_t i = 0; i < array->length; ++i) {
data_buffer_view[i] = static_cast<DiskType>(original_data_buffer_view[i]);
}

setup_write_column(
schema->name,
array->length,
std::move(data_buffer),
(uint64_t*)nullptr,
_cast_validity_buffer_ptr(array));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just throw. Since arrow/pyarrow/nanoarrow have an explicit schema, we should just require the user pass us data with appropriate types. The only case we want to auto-cast is when the user provides us with a dictionary where the values match the on disk data type of a non-enumerated column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may break existing unit tests in Python/R and will definitely require new unit tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this change will require non trivial changes and may break current workflows, should be in a separate PR?


CArrayColumnBuffer() = delete;
CArrayColumnBuffer(const CArrayColumnBuffer&) = delete;
CArrayColumnBuffer(CArrayColumnBuffer&&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following warning when compiling:

/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:401:5: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted]
  401 |     CArrayColumnBuffer(CArrayColumnBuffer&&) = default;
      |     ^
/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:375:28: note: move constructor of 'CArrayColumnBuffer' is implicitly deleted because base class 'ReadColumnBuffer' has a deleted move constructor

@XanthosXanthopoulos XanthosXanthopoulos changed the title [c++][WIP] Optimize memory management on write path [c++] Optimize memory management on write path Nov 25, 2025
…rrow objects (#4334)

* Add custom allocator for vector backed buffers, implement memory modes for TileDB to Arrow conversions

* Lint fix

* Migrate R to use multithreaded arrow conversion

* Fix compiler warnings

* Rebind buffers after each read operation

* Read memory mode from config

* Include Rcpp header before other headers

* Store object references when setting column data until write is submitted (#4350)
@jp-dark jp-dark removed the request for review from bkmartinjr December 22, 2025 13:54
self._handle.submit_write()

# clear stored data objects
self._ref_store.clear()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ManagedQuery allow buffers to be re-used across multiple submit calls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffers used for write operations are not reused. The managed query only gets a view of them from wherever they come from.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the C/C++ API buffers can be re-used for multi-part global order writes. But IIRC those aren't supported by Python yet, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the buffers we supply to the C++ API are owned by numpy or Arrow so we do not do anything with them other than that.

@rroelke
Copy link

rroelke commented Dec 22, 2025

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well

Is this just for the write path? Or does this also include reads? It might be nice to see a breakdown.

@XanthosXanthopoulos
Copy link
Collaborator Author

Yes this was just for writes. Reads were slower before merging the different memory modes PR but after they should be on par or faster. I haven't run the benchmarks yet

Copy link

@rroelke rroelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished yet, I've made my way through column_buffer.{cc,h} so far.

I have left a slew of comments but nothing particularly regarding safety, most of them are cosmetic, for which I defer to y'all as I am not a SOMA maintainer. These few are the most important:

At a higher level I don't really understand why the CArrayColumnBuffer would have any different performance characteristics than the VectorColumnBuffer - I would expect these to be making very similar patterns of memory allocations as long as the move constructors and etc are used appropriately. Data overrides my intuition of course. But I have a feeling that you don't actually need to separate these - the separation of the read and write path would be enough.

pydict["bar"] = [4.1, 5.2, 6.3, 7.4, 8.5]
pydict["baz"] = ["apple", "ball", "cat", "dog", "egg"]
rb = pa.Table.from_pydict(pydict)
rb = pa.Table.from_pydict(pydict, schema=obs_arrow_schema.insert(0, pa.field("soma_joinid", pa.int64())))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you are choosing to add the field this way instead of inline above?

]
pydict["soma_joinid"] = [1, 2]
pydict["quality"] = [4.1, 5.2]
pydict["quality"] = pa.array([4.1, 5.2], type=pa.float32())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? Is this to test a temporary object used in ManagedQuery?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default the datatype would be inferred as int64 and this would fail in ManagedQuery as we removed implicit castings


if (is_var()) {
std::unique_ptr<uint64_t[]> offsets_buffer = std::make_unique_for_overwrite<uint64_t[]>(num_cells_ + 1);
std::memcpy(offsets_buffer.get(), this->offsets().data(), (num_cells_ + 1) * sizeof(uint64_t));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how many times a developer has written this TileDB-to-arrow buffer conversion. It's definitely a good piece of technical debt that core doesn't produce arrow, when so many upstream users want it.

}
} else {
if (type() != TILEDB_BOOL) {
data_buffer = std::make_unique_for_overwrite<std::byte[]>(data_size_);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any evidence that this mode is ever better?
The above approach conceptually allocates more memory - twice the max data size. Whereas here you would end up with a potentially smaller number of variably-sized allocations. And keep in mind if this is a read query it's probably filling the buffer with as many cells as you have room for - whatever piece is left is going to be small. In practice the allocator might give you the same size blocks anyway.
Instead of varying by mode you may want to make the choice based on the ratio of data_size_ to max_data_size_

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data_size_ to max_data_size_ is a much better approach that is planned for a follow up PR. By default the allocated memory per column is 1GB and for small reads it is inefficient as they have the same lifetime as the resulting Arrow table. These overallocations allocation are most of the time only virtual allocations there has been random memory errors that may be related.

return std::span<uint8_t>(validity_.data(), num_cells_);
}

std::unique_ptr<IArrowBufferStorage> VectorColumnBuffer::export_buffers() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost entirely the same as the CBuffer one, I suggest using a helper function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorColumnBuffer will be removed entirelly once CArrayColumnBuffer has matured

jp-dark added a commit that referenced this pull request Dec 23, 2025
Add test for dictionary casting from
#4311

Co-authored-by:  XanthosXanthopoulos <[email protected]>
jp-dark added a commit that referenced this pull request Jan 8, 2026
…4359)

* (WIP) Safe-cast pyarrow tables on write

Still needs the following:
* fix schema names for GeometryDataFrame
* test unsafe casting

* Fix casting for geometry dataframe outlines

* Update history

* Switch from deprecated `field_by_name` to `field`

* Update error message and remove unneeded type declaration

* Take tests from PR #4311

Add test for dictionary casting from
#4311

Co-authored-by:  XanthosXanthopoulos <[email protected]>

* Add xfail to uncovered bug

* Remove test that is checking for unsafe cast

* Fix syntax for xfail

---------

Co-authored-by: XanthosXanthopoulos <[email protected]>
jp-dark added a commit that referenced this pull request Jan 12, 2026
…4359)

* (WIP) Safe-cast pyarrow tables on write

Still needs the following:
* fix schema names for GeometryDataFrame
* test unsafe casting

* Fix casting for geometry dataframe outlines

* Update history

* Switch from deprecated `field_by_name` to `field`

* Update error message and remove unneeded type declaration

* Take tests from PR #4311

Add test for dictionary casting from
#4311

Co-authored-by:  XanthosXanthopoulos <[email protected]>

* Add xfail to uncovered bug

* Remove test that is checking for unsafe cast

* Fix syntax for xfail

---------

Co-authored-by: XanthosXanthopoulos <[email protected]>
(cherry picked from commit a1f6a68)
jp-dark added a commit that referenced this pull request Jan 12, 2026
…4359)

* (WIP) Safe-cast pyarrow tables on write

Still needs the following:
* fix schema names for GeometryDataFrame
* test unsafe casting

* Fix casting for geometry dataframe outlines

* Update history

* Switch from deprecated `field_by_name` to `field`

* Update error message and remove unneeded type declaration

* Take tests from PR #4311

Add test for dictionary casting from
#4311

Co-authored-by:  XanthosXanthopoulos <[email protected]>

* Add xfail to uncovered bug

* Remove test that is checking for unsafe cast

* Fix syntax for xfail

---------

Co-authored-by: XanthosXanthopoulos <[email protected]>
(cherry picked from commit a1f6a68)
jp-dark added a commit that referenced this pull request Jan 13, 2026
…4359) (#4369)

* (WIP) Safe-cast pyarrow tables on write

Still needs the following:
* fix schema names for GeometryDataFrame
* test unsafe casting

* Fix casting for geometry dataframe outlines

* Update history

* Switch from deprecated `field_by_name` to `field`

* Update error message and remove unneeded type declaration

* Take tests from PR #4311

Add test for dictionary casting from
#4311



* Add xfail to uncovered bug

* Remove test that is checking for unsafe cast

* Fix syntax for xfail

---------


(cherry picked from commit a1f6a68)

Co-authored-by: XanthosXanthopoulos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants