[c++] Optimize memory management on write path #4311

XanthosXanthopoulos · 2025-11-13T17:55:46Z

Issue and/or context: SOMA-528 SOMA-714 SOMA-688

Changes:
This PR changes the memory management for the read/write operations implemented by ManagedQuery. Specifically:

Replaces std::vector backed buffers with C++ arrays wrapped in std::unique_ptr
Optimizes null count calculation for nullable columns
Makes TileDB buffers to Arrow table conversion multithreaded
When possible the writes are now zero copy. Passing temporary object to set the data buffer for writes will crash the program because setting the buffers and writing them to TileDB is not an atomic operation
Removes implicit casting of numeric data when writing to TileDB. When passing data to read/write you should use the SOMAArray provided schema to type casts in advance
Fix index casting when writing dictionaries
Properly write validity buffers for nullable enumerated columns

Notes for Reviewer:

codecov · 2025-11-14T10:39:30Z

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.83%. Comparing base (a43bade) to head (1a00005).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4311      +/-   ##
==========================================
- Coverage   86.83%   86.83%   -0.01%     
==========================================
  Files         137      138       +1     
  Lines       20736    20743       +7     
  Branches       15       16       +1     
==========================================
+ Hits        18007    18013       +6     
- Misses       2729     2730       +1

Flag	Coverage Δ
python	`89.16% <ø> (-0.03%)`	⬇️
r	`85.62% <87.50%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`89.16% <ø> (-0.03%)`	⬇️
libtiledbsoma	`76.77% <75.00%> (-0.47%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bkmartinjr

are there any measurements of the time/space impact of this PR?

XanthosXanthopoulos · 2025-11-16T18:44:16Z

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well

jp-dark · 2025-11-18T17:24:41Z

libtiledbsoma/src/soma/managed_query.h

+            } else {
+                // Casting is needed and casted data ownership should pass to the column
+
+                std::unique_ptr<std::byte[]> data_buffer = std::make_unique_for_overwrite<std::byte[]>(
+                    array->length * sizeof(DiskType));
+
+                std::span<UserType> original_data_buffer_view(buf, array->length);
+                std::span<DiskType> data_buffer_view(reinterpret_cast<DiskType*>(data_buffer.get()), array->length);
+
+                for (int64_t i = 0; i < array->length; ++i) {
+                    data_buffer_view[i] = static_cast<DiskType>(original_data_buffer_view[i]);
+                }
+
+                setup_write_column(
+                    schema->name,
+                    array->length,
+                    std::move(data_buffer),
+                    (uint64_t*)nullptr,
+                    _cast_validity_buffer_ptr(array));
+            }


This should just throw. Since arrow/pyarrow/nanoarrow have an explicit schema, we should just require the user pass us data with appropriate types. The only case we want to auto-cast is when the user provides us with a dictionary where the values match the on disk data type of a non-enumerated column.

This change may break existing unit tests in Python/R and will definitely require new unit tests.

Since this change will require non trivial changes and may break current workflows, should be in a separate PR?

jp-dark · 2025-11-18T17:38:19Z

libtiledbsoma/src/soma/column_buffer.h

+
+    CArrayColumnBuffer() = delete;
+    CArrayColumnBuffer(const CArrayColumnBuffer&) = delete;
+    CArrayColumnBuffer(CArrayColumnBuffer&&) = default;


I get the following warning when compiling:

/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:401:5: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted] 401 | CArrayColumnBuffer(CArrayColumnBuffer&&) = default; | ^ /home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:375:28: note: move constructor of 'CArrayColumnBuffer' is implicitly deleted because base class 'ReadColumnBuffer' has a deleted move constructor

libtiledbsoma/src/soma/column_buffer.h

…rrow objects (#4334) * Add custom allocator for vector backed buffers, implement memory modes for TileDB to Arrow conversions * Lint fix * Migrate R to use multithreaded arrow conversion * Fix compiler warnings * Rebind buffers after each read operation * Read memory mode from config * Include Rcpp header before other headers * Store object references when setting column data until write is submitted (#4350)

rroelke · 2025-12-22T14:35:58Z

apis/python/src/tiledbsoma/_managed_query.py

+        self._handle.submit_write()
+
+        # clear stored data objects
+        self._ref_store.clear()


Does ManagedQuery allow buffers to be re-used across multiple submit calls?

Buffers used for write operations are not reused. The managed query only gets a view of them from wherever they come from.

In the C/C++ API buffers can be re-used for multi-part global order writes. But IIRC those aren't supported by Python yet, right?

Yes the buffers we supply to the C++ API are owned by numpy or Arrow so we do not do anything with them other than that.

rroelke · 2025-12-22T15:21:57Z

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well

Is this just for the write path? Or does this also include reads? It might be nice to see a breakdown.

XanthosXanthopoulos · 2025-12-22T15:25:11Z

Yes this was just for writes. Reads were slower before merging the different memory modes PR but after they should be on par or faster. I haven't run the benchmarks yet

rroelke

I haven't finished yet, I've made my way through column_buffer.{cc,h} so far.

I have left a slew of comments but nothing particularly regarding safety, most of them are cosmetic, for which I defer to y'all as I am not a SOMA maintainer. These few are the most important:

At a higher level I don't really understand why the CArrayColumnBuffer would have any different performance characteristics than the VectorColumnBuffer - I would expect these to be making very similar patterns of memory allocations as long as the move constructors and etc are used appropriately. Data overrides my intuition of course. But I have a feeling that you don't actually need to separate these - the separation of the read and write path would be enough.

apis/python/tests/test_dataframe.py

rroelke · 2025-12-22T14:40:40Z

apis/python/tests/test_experiment_basic.py

    pydict["bar"] = [4.1, 5.2, 6.3, 7.4, 8.5]
    pydict["baz"] = ["apple", "ball", "cat", "dog", "egg"]
-    rb = pa.Table.from_pydict(pydict)
+    rb = pa.Table.from_pydict(pydict, schema=obs_arrow_schema.insert(0, pa.field("soma_joinid", pa.int64())))


How come you are choosing to add the field this way instead of inline above?

rroelke · 2025-12-22T14:41:43Z

apis/python/tests/test_geometry_dataframe.py

        ]
        pydict["soma_joinid"] = [1, 2]
-        pydict["quality"] = [4.1, 5.2]
+        pydict["quality"] = pa.array([4.1, 5.2], type=pa.float32())


Can you explain this change? Is this to test a temporary object used in ManagedQuery?

By default the datatype would be inferred as int64 and this would fail in ManagedQuery as we removed implicit castings

apis/python/tests/test_sparse_nd_array.py

libtiledbsoma/src/soma/array_buffers.h