Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear search when releasing a sample scales very poorly #2221

Open
gpalmer-latai opened this issue Mar 11, 2024 · 38 comments
Open

Linear search when releasing a sample scales very poorly #2221

gpalmer-latai opened this issue Mar 11, 2024 · 38 comments
Labels
enhancement New feature

Comments

@gpalmer-latai
Copy link

gpalmer-latai commented Mar 11, 2024

Required information

Operating system:
Ubuntu 20.04 LTS

Compiler version:
GCC 9.4.0

Eclipse iceoryx version:
master branch

Observed result or behaviour:
The time it takes to drain a buffer of subscriber samples increases quadratically. At only 100,000 samples a process running on my desktop will hang for minutes.

See this minimal reproducing example: #2219

The quadratic time complexity of draining a subscriber buffer comes from this linear search in the UsedChunkList:

// go through usedList with stored chunks

Expected result or behaviour:

Releasing a chunk should ideally be a O(1) operation.

I have not yet dived too deeply into the datastructure but I imagine it would be somehow possible for a sample to keep track of its location in the UsedChunkList and be able to directly remove it. Assuming however that the current UsedChunkList is just a forward list, we would need to go in and change it to be a doubly-linked list to facilitate O(1) removal.

Increasing the subscriber queue depth beyond the default 256 as well as doing custom buffering to prolong the lifetime of interesting messages in a zero-copy fashion is a useful and common use case, so accumulating these samples should not have such a drastic effect on performance.

Conditions where it occurred / Performed steps:

A minimal, runnable example is provided here: #2219

Basically all one has to do is increase the MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY to something large, buffer ~10,000 - 100,000 samples, and then try releasing them all at once. The amount of time it takes to do so grows quadratically. The linked example collects some statistics on overall time, as well as worst case time to release a single message.

At ~10,000 on my desktop I came up with:

Stats from run 3: 

Total time elapsed (ns): 4281156376
Max release time (ns): 944313
Min release time (ns): 0
Avg release time (ns): 428587

As you see here - the average time it takes to release a sample from a 10,000 sample buffer is ~.42 ms. Multiply this by 10,000 and it takes about 4.2 seconds to drain the buffer.

If you increase the sample size (no pun intended) to 100,000 you should see an average latency of 4.2ms, which, multiplied by 100,000 will result in a total time to drain the buffer of 420 seconds, or 7 minutes.

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 11, 2024

Took another look. There doesn't seem to be any issue with removing a chunk when we know its index in the UsedChunkList. That would essentially just skip to this line of code:

chunk = m_listData[current].releaseToSharedChunk();

@gpalmer-latai
Copy link
Author

Yes so, after my first quick pass I'm thinking we can update the API's in the following way to achieve the goal of eliminating this linear search:

  1. For UsedChunkList::insert, instead of returning a bool, return an expected<uint32_t, InsertChunkError>, where the uint32_t is the index into m_listData where the chunk was inserted
  2. In ChunkReceiver::tryGet(), instead of returning expected<const mepoo::ChunkHeader*, ChunkReceiveResult>, it should return expected<UsedChunk, ChunkReceiveResult> where UsedChunk is a struct containing a const mepoo::ChunkHeader* and the uint32_t index where that chunk lives in the used chunk list.
  3. This should get propagated out to SubscriberPortUser::tryGetChunk()
  4. And again to BaseSubscriber::takeChunk()
  5. In Subcriber::take(), the deleter around the sample unique pointer should call releaseChunk() with the modified UsedChunk object
  6. SubscriberPortUser::releaseChunk should get updated to take the UsedChunk datastructure instead of the chunk header
  7. ChunkReceiver::release gets updated to take UsedChunk instead of the chunk header
  8. UsedChunkList::remove gets updated to take the UsedChunk instead of the chunk header
  9. It then skips the linear search and directly looks up the data in m_listData given the index stored in the UsedChunk struct. To check out logic is sound we add a contract check validating that m_listData[usedChunk.index].getChunkHeader() == usedChunk.header

@elfenpiff
Copy link
Contributor

@gpalmer-latai This approach could work for the good case. But keep in mind, the UsedChunkList is a lock-free construct, and they often have unpleasant surprises for you that may need weeks to fix.

But the bad case still has some problems. When an application crashes and someone has to iterate once over 100.000 chunks to clean them up manually you will see a hick up in the system where suddenly, due to a crash, the runtime increases. This is maybe unacceptable for a real-time system and for a safety-critical one (freedom from interference).

The point I am trying to make is, that we can for sure apply your solution but in my opinion it maybe solves a problem that should not exist in this current form. I do not doubt the use case but the used messaging pattern (publish-subscribe). Could you tell us a bit more about the setup, especially:

  1. How many subscribers do you have? (I assume just 1 or 2)
  2. How many publishers do you have? (I assume many)
  3. And the overall task? (I assume some kind of logging)

In iceoryx2 we have planned messaging patterns also like: pipeline & blackboard and those could be easily manually implemented in iceoryx since all the right building blocks are already here.

@gpalmer-latai
Copy link
Author

But keep in mind, the UsedChunkList is a lock-free construct

This isn't relevant as the data structure is not meant to be used concurrently. It is even explicitly NOT thread safe. It exists in shared memory only to allow RouDi to clean up when the subscriber process crashes.

But the bad case still has some problems. When an application crashes and someone has to iterate once over 100.000 chunks to clean them up manually you will see a hick up in the system where suddenly, due to a crash, the runtime increases

Not so. While RouDi will not have the index, it doesn't really matter. RouDi doesn't need to release the samples in any particular order so it can simple iterate forward through the list, releasing everything with O(n) time complexity. For 100,000 samples this is probably just a ~10ms operation.

The problem described in this issue has to do with random access time complexity. But the solution is the same as it is for a std::list - simple allow a version of remove which takes an index/iterator to the element's position, known when it is inserted.

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 12, 2024

As for our setup.

  1. There are just a few subscribers per topic, usually, yes. But there could be more depending on the use case.
  2. The problem described here applies to just a single publisher/subscriber (as I've shown in my example). In general though we might only have one publisher on a topic, or we might have many.
  3. The important information about the task is that we maintain a buffer of recent history, whose size correlates to the overall publish rate of the topic.

As a mitigation we are already exploring batching to reduce publisher sample count, but that can only help to an extent and is not always possible.

It is also worth noting that we can't rely on our pattern of access here to solve the problem. For example - when we drain samples with oldest samples first, we are hitting the worst case scenario because the oldest samples are at the back of the used chunk list. You might solve this problem by somehow allowing reverse iteration of the list, but we will still have other release patterns with relatively new and in-the-middle samples.

@elfenpiff
Copy link
Contributor

elfenpiff commented Mar 12, 2024

@gpalmer-latai

This isn't relevant as the data structure is not meant to be used concurrently. It is even explicitly NOT thread safe. It exists in shared memory only to allow RouDi to clean up when the subscriber process crashes.

I think it is used concurrently. Whenever a publisher delivers a sample, it calls UsedChunkList::insert() in ChunkSender::tryAllocate() and the subscriber calls UsedChunkList::remove() when releasing the chunk in ChunkReceiver::release(). This is somehow hidden, but getMembers() in both constructs should return the data to the same underlying construct.

But it is possible that I get this wrong, since I am not so familiar with the source code anymore. @elBoberido I think you are the one who implemented it, maybe you can shine some light on it.

Addendum: I was wrong, the list must just be synced with RouDi.

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 12, 2024

No worries 🙂

I just took a closer look and there is a slight complication to the approach I outlined. In true forward list fashion, list removal does unfortunately require knowing the index to the previous element:

m_listIndices[previous] = m_listIndices[current];

Will have to do some more brainstorming to figure out what additional info needs to be stored where to make this work.

The first obvious solution that pops out to me though is that we simply return from insert and take as an argument to removal:

struct UsedChunk
{
  const mepoo::ChunkHeader* chunkHeader;
  uint32_t current;
  uint32_t previous;
}

@gpalmer-latai
Copy link
Author

Note that this information basically gets immediately plumbed into the deleter of a unique_ptr here

@gpalmer-latai
Copy link
Author

Ah, but the untyped subscriber doesn't so neatly encode a destructor: https://github.com/gpalmer-latai/iceoryx/blob/e97d209e13c36d71237c99e0246310b8029f8f26/iceoryx_posh/include/iceoryx_posh/internal/popo/untyped_subscriber_impl.inl#L54, so we'd have to store this information elsewhere...

@elfenpiff
Copy link
Contributor

@gpalmer-latai Instead of using some kind of list, couldn't you fill the m_listData with optionals and set them to nullopt when it is no longer set?

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 12, 2024

@gpalmer-latai Instead of using some kind of list, couldn't you fill the m_listData with optionals and set them to nullopt when it is no longer set?

The performance hit isn't in removal of the elements. It is in locating where they are. Setting aside the quirks of implementation (every operation has to be on less than 64 bits because of torn writes), the UsedChunkList is more or less just a typical forward list. Removal is O(1) provided you already know the location of the data in the list. That is the part that is missing here.

@gpalmer-latai
Copy link
Author

So following along my line of thought here - using

struct UsedChunk
{
  const mepoo::ChunkHeader* chunkHeader;
  uint32_t current;
  uint32_t previous;
}

in the UsedChunkList API's would be sufficient to facilitate O(1) removal. It is similar to how a std::list::insert will yield you an iterator.

The problem then becomes plumbing this through the rest of the stack. For the typed API's it is relatively simple. You just bubble this data structure up to where the sample is created and plumb it into the deleter of the iox::unique_ptr: https://github.com/gpalmer-latai/iceoryx/blob/e97d209e13c36d71237c99e0246310b8029f8f26/iceoryx_posh/include/iceoryx_posh/internal/popo/subscriber_impl.inl#L49

The catch is untyped APIs. They currently only return void * pointers and have separate release(void*)/publish(void*), etc... methods. I can think of a few options to handle these:

  1. Instead of returning void*, return something like the UsedChunk but with the pointer being to the payload and the indices possibly being private with friend access to the publisher/subscriber classes. This is quite ugly and users would have to adapt code to handle these ugly things. It might be less ugly if you add some operator->, but still...
  2. Instead of returning void*, return directly or wrapped a iox::unique_ptr<void*> with the same custom deleter logic as the typed API's. Remove the release methods and change the untyped API's to use the same RAII logic as the typed ones. Users then have to call sample.get() to get the void* pointer, which they can then use the same way as before. We could even add some convenience methods to make this easier such as
template <typename T>
T* UntypedSample::getTyped()
{
  return static_cast<T*>(m_ptr.get());
}

@gpalmer-latai
Copy link
Author

If it is preferable to you I'm happy to speak about this problem synchronously via some video chat. It is a bit of a high priority issue for us and I intend to implement some solution immediately.

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 12, 2024

There is also a compromise solution. We could add the plumbing to allow typed endpoints to have fast removal by referencing the insertion index in those custom deleters, but leave the untyped endpoints alone and fallback to linear search.

I'm not a fan of this solution because of the general complexity of supporting two different paths, and also because it will require us to replace our usage of the untyped subscriber with typed ones, which won't be trivial because we actually rely on the type-erased nature of samples received this way, extracting the size from the header. Using typed API's instead would require some ugly hackery. I think it is doable, but still...

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 12, 2024

Ah shoot, realization just dawned on me about a flaw in my proposed solution here.

We cannot simply store the previous element index in the returned data structure, because the previous element could be removed, invalidating this index. Instead what we probably need to do is make this a "doubly linked list" by adding a m_listIndicesBackwards.

@gpalmer-latai
Copy link
Author

Hah, cool. iox::unique_ptr supports type erasure unlike std::unique_ptr because of the non-templated deleter. Neat. Seems like we really could return a Sample<void, H> for untyped endpoints.

@gpalmer-latai
Copy link
Author

FYI I'm working through an implementation right now. I've gone a different route in refactoring the UsedChunkList and am leveraging the mpmc_loffli to make the UseChunkList basically a slot map (since ordering doesn't matter). It seems like this might accidentally also make publishers and subscribers thread safe, but we'll see.

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 16, 2024

Can confirm that using the mpmc_loffli works. Updating the used chunk list unit tests to use size 100,000 results in a hang on master, and completes in the snap of a finger on my experimental branch.

Working through the other API layers now.

@elBoberido
Copy link
Member

Sorry for the late response. The mpmc_loffli approach should work but might add a higher cost than necessary due too memory synchronization. Have you thought about using the FixedPositionContainer? It's basically a hop slot map (found the name only after implementation). This is only a detail, though.

As you noted, the more critical part is the API breakage for the untyped API which is also used by the C Binding. We have to be careful here since we might break quite a lot of code out there.

@gpalmer-latai
Copy link
Author

My intention is to try to run Iceperf to see the impact of the synchronization. If it is too problematic we can always fall back to a simpler slot map without the mpmc part.

@gpalmer-latai
Copy link
Author

I think I might also be able to revert my changes to the untyped API's to use the Smart_Chunk in favor of returning a struct with the slot map token directly. It is ugly but if that is what C requires 🤷

@elBoberido
Copy link
Member

elBoberido commented Mar 19, 2024

There is another option. Leave the current API as legacy API and create a new one in the experimental folder. It is a bit more of work but also gives us the option to experiment with the best approach. Especially if you encounter more issues with your specific setup of 100k samples.

We could use this chance to rethink the untyped API a bit. Instead of using void* as data type we could reuse the typed API and just request a dynamic std::byte array with a user defined alignment. After all, that's essentially the use case for the untyped API. We would have the additional benefit of having type safety and this could also be extended to other types than std::byte.

If you can read some Rust, it could look similar to this https://github.com/eclipse-iceoryx/iceoryx-rs/blob/master/examples/publisher_untyped.rs#L20-L30

@gpalmer-latai
Copy link
Author

With the experimental API, I assume we still have to maintain the changes to the middleware layers chunk_reciever, etc... to support keeping track of the slot map iterator?

But then we can maintain a raw pointer and iterator path separately and fallback to iteration when removing from the slot map with a pointer.

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 20, 2024

By the way, I've got Iceperf building and running in my branch.

It looks like average RTT latency went from 4.2 to 5.9.

I'll try now with the FixedPositionContainer.

@gpalmer-latai
Copy link
Author

WIth the FixedPositionContainer I get back down to 5.5. I suspect most of the added latency is therefore from the friction I've added through the upper API layers passing around a struct instead of a pointer. Also if Iceperf uses the untyped API I suspect the logic of creating a smart chunk might contribute.

@gpalmer-latai
Copy link
Author

Have you thought about using the FixedPositionContainer?

Question about this actually - does the container still satisfy this constraint?

In order to always be able to access the used chunks, neither a vector or list can be used, because these container could be corrupted when the application dies in the wrong moment.

I think that since there is no way to just "iterate over the data array", RouDi couldn't safely release all the shared chunks.

So I AM forced to use a separate free list - though instead of the mpmc_loffli I could just use an iox::vector of indices.

@gpalmer-latai
Copy link
Author

It's too bad because the iteration properties of the FixedPositionContainer would make it a suitable choice for backwards compatibility - allowing the existing untyped API to continue releasing with the raw pointer and just iterating over the container to match against that pointer.

@elBoberido
Copy link
Member

I need to look closer at the FixedPositionContainer but why do you think it is not suitable?

Btw, those benchmark numbers are quite high. Did you test in release mode?

@gpalmer-latai
Copy link
Author

I need to look closer at the FixedPositionContainer but why do you think it is not suitable?

Because of this invariant the UsedChunkList needs to uphold:

In order to always be able to access the used chunks, neither a vector or list can be used, because these container could be corrupted when the application dies in the wrong moment.

In order for RouDi to clean up the shared chunks, it needs to iterate over the actual data array. The FixedPositionContainer does not expose this directly (though I suppose you could always add some friend accessor for RouDi), and the iterators could be left in an undefined state if the subscriber crashed at the wrong moment.

@gpalmer-latai
Copy link
Author

Btw, those benchmark numbers are quite high. Did you test in release mode?

I don't know. I'm not sure it matters too much for comparison sake, though I could go back and fiddle around with it some more. The point is that my changes altered the Iceperf average under the default build configuration from 4.2 microseconds RTT to 5.9 for the mpmc_loffli implementation with altered APIs, 5.6 for the FixedPositionContainer implementation with altered APIs, and 4.6 when ONLY swapping out the FixedPositionContainer for the forward list but leaving the APIs unchanged.

@gpalmer-latai
Copy link
Author

Right now I'm doing another pass using a simple iox::vector as the freelist and having some lighter API changes. I'll see what the performance there is when I'm done. For our immediate purposes the increase in latency isn't really a major concern, but we do want to be able to get this to a state that we can at least propose a reasonable experimental implementation to upstream.

@elBoberido
Copy link
Member

From my gut feelings I think it will be worse for large amount of samples, especially when the ones from the beginning of the container are removed.

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 25, 2024

It will be worse if you have to perform a linear search, yes. That would be the case if for example we wished to maintain the legacy API, since over time you may end up iterating over large swaths of tombstone values (unless we can adapt the FixedPositionContainer to allow bypassing iteration for cleanup, in which case we have the best of both worlds - a slot map with O(1) removal and effecient iteration).

However for the typed API and experimental new untyped subscriber, the removal will always be O(1) because you directly pass the slot handle with the index allocated from the free list as an argument to remove.

@gpalmer-latai
Copy link
Author

When you call FixedPositionContainer::clear by the way, it does circumvent the possibly-corrupted iterators and reset everything. But it does so by calling the destructor on all the data - which is insufficient for the UsedChunkList because rather than calling the destructor on the elements, it needs to call releaseToSharedChunk() instead. Perhaps there is a workaround there with either adding a destructor to shmSafeUnmangedChunk or allowing a custom callback on the clear method which defaults to just calling the destructor on the element.

@elBoberido
Copy link
Member

Not sure what you mean by iterating over tombstone values? The FixedPositionContainer skips removed elements on iterations.

The FixedPositionContainer would need some adaptions but I think they would not be too intrusive. There are basically two options. Either adding a custom callback to the clear method or having a drain method with a custom callback.

@gpalmer-latai
Copy link
Author

Not sure what you mean by iterating over tombstone values? The FixedPositionContainer skips removed elements on iterations.

I meant for implementations using iox::vector or MpmcLoffli

If we adapt the FixedPositionContainer though we could support both use cases well.

@gpalmer-latai
Copy link
Author

So just as a quick update - switching from MpmcLoffli to iox::vector works fairly well. I haven't had a chance to run Iceperf yet.

I have incorporated a much more polished version of master...gpalmer-latai:iceoryx:iox-2221-constant-time-chunk-release into our fork of Iceoryx - one which backtracks the changes to the untyped API's to use smart wrappers, but instead returns and takes the slot map handle (renamed from UsedChunk to UsedChunkHandle) directly.

Unfortunately I will have to context switch to another task for the time being and don't have an upstream PR / design proposal to share as of yet. Once I am able to free up some time to do so though, I would propose something along these lines:

  1. Replace the forward list in the UsedChunkList with the FixedPositionContainer.
  2. Adapt the FixedPositionContainer to allow custom cleanup of the elements s.t. RouDi would be able to release a bunch of shmSafeUnmanagedChunks contained by one.
  3. Create a UsedChunkHandle which acts as a sort of slot map handle to the UsedChunkList (which could arguably be renamed to UsedChunkSlotMap). It could have conveniences like a * and -> operator to the underlying chunk Header.
  4. Propagate this handle up and down all inner layers of the stack. For allocating the chunk this would replace existing calls returning a naked pointer. For releasing the chunk it would be an additional overload.
  5. Adapt the SmartChunk class to work with this handle, piping it through to release callbacks upon destruction and releasing it explicitly when release() is called.
  6. Update the typed endpoints as necessary to create the SmartChunk with these changes.
  7. Leave the untyped endpoints alone for now - call chunkHandle->userPayload() when handing the allocated pointer to the user, and continue using the existing release/publish paths that take a pointer
  8. When releasing a chunk from the UsedChunkList, there will be both an O(1) overload that takes the handle and removes the corresponding element, and an O(n) overload that takes a pointer and does a linear search for the corresponding element. This will still be used for the Untyped endpoints and C API's
  9. Create a new set of untyped endpoints under the experimental folder returning variants of the SmartChunk which expose dynamic spans instead of typed data. These will also use the UsedChunkHandle under the hood and therefore will support efficient release of the chunks.

@elBoberido
Copy link
Member

@gpalmer-latai that sounds reasonable. Go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature
Projects
None yet
Development

No branches or pull requests

4 participants