Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15859: Make RemoteListOffsets call an async operation #16602

Merged
merged 7 commits into from
Sep 15, 2024

Conversation

kamalcph
Copy link
Contributor

@kamalcph kamalcph commented Jul 16, 2024

This is the part-2 of the KIP-1075

To find the offset for a given timestamp, ListOffsets API is used by the client. When the topic is enabled with remote storage, then we have to fetch the remote indexes such as offset-index and time-index to serve the query. Also, the ListOffsets request can contain the query for multiple topics/partitions.

The time taken to read the indexes from remote storage is non-deterministic and the query is handled by the request-handler threads. If there are multiple LIST_OFFSETS queries and most of the request-handler threads are busy in reading the data from remote storage, then the other high-priority requests such as FETCH and PRODUCE might starve and be queued. This can lead to higher latency in producing/consuming messages.

In this patch, we have introduced a delayed operation for remote list-offsets call. If the timestamp need to be searched in the remote-storage, then the request-handler threads will pass-on the request to the remote-log-reader threads. And, the request gets handled in asynchronous fashion.

Covered the patch with unit and integration tests.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@kamalcph kamalcph added the tiered-storage Related to the Tiered Storage feature label Jul 16, 2024
@kamalcph
Copy link
Contributor Author

@chia7712 @showuon @satishd

Call for review. PTAL.

@kamalcph
Copy link
Contributor Author

Test failures are unrelated.

@kamalcph kamalcph force-pushed the KAFKA-15859a branch 4 times, most recently from 516edad to 3e627ac Compare July 23, 2024 11:14
@kamalcph kamalcph force-pushed the KAFKA-15859a branch 3 times, most recently from a448c3e to 9464b90 Compare August 2, 2024 04:52
@showuon
Copy link
Contributor

showuon commented Aug 2, 2024

@kamalcph , I'd like to make sure this PR can be reviewed before KIP-1075 get approved. Is that right?

@kamalcph
Copy link
Contributor Author

kamalcph commented Aug 2, 2024

@kamalcph , I'd like to make sure this PR can be reviewed before KIP-1075 get approved. Is that right?

yes, this PR can be reviewed. There are no public API changes made in this PR. To define the timeout for delayed remote list offsets operation, reused the server request timeout since the tiered storage is not production ready. If it is not acceptable, then we may have to wait for the KIP-1075 approval.

@showuon
Copy link
Contributor

showuon commented Aug 2, 2024

Well, I'll review KIP-1075 first then.

@kamalcph
Copy link
Contributor Author

kamalcph commented Aug 3, 2024

@showuon @clolov

Is it possible to give this PR an early review to keep it in good shape? Thanks!

Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a quick pass, I understand more about what we're trying to achieve here. High-level comment: I saw there are some new added files are in scala, could we re-write into java? You don't have to do that now, please wait until KIP accepted. Let's discuss more detail in KIP discussion. Thanks.

@chia7712
Copy link
Contributor

chia7712 commented Sep 3, 2024

@kamalcph please fix conflicts, thanks :)

@kamalcph kamalcph force-pushed the KAFKA-15859a branch 2 times, most recently from 2608e65 to e36054c Compare September 3, 2024 18:25
@kamalcph
Copy link
Contributor Author

kamalcph commented Sep 3, 2024

@showuon @chia7712 @satishd @clolov

The diff is ready for review. PTAL. Thanks!

Copy link
Contributor

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kamalcph thanks for this patch and sorry for late reviews. a couple of comments are left. PTAL

delayedRemoteListOffsetsPurgatory.checkAndComplete(key);
})
);
return new AsyncOffsetReadFutureHolder<>(jobFuture, taskFuture);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon me, why we need two futures here? Is CompletableFuture.supplyAsync unsuitable to this case? for example:

        CompletableFuture<Optional<FileRecords.TimestampAndOffset>> taskFuture = CompletableFuture.supplyAsync(() -> {
            try {
                // If it is not found in remote storage, then search in the local storage starting with local log start offset.
                Optional<FileRecords.TimestampAndOffset> rval = findOffsetByTimestamp(topicPartition, timestamp, startingOffset, leaderEpochCache);
                if (rval.isPresent()) return rval;
                return OptionConverters.toJava(searchLocalLog.get());
            } catch (Exception e) {
                // NOTE: All the exceptions from the secondary storage are catched instead of only the KafkaException.
                LOGGER.error("Error occurred while reading the remote log offset for {}", topicPartition, e);
                throw new RuntimeException(e);
            }
        }, remoteStorageReaderThreadPool);

Copy link
Contributor Author

@kamalcph kamalcph Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

The reason for maintaining 2 futures: jobFuture and taskFuture. They are required to trigger the delayed operation completion (delayedRemoteListOffsetsPurgatory#checkAndComplete(key)) in the same remote-log-reader thread after the RemoteLogManager#findOffsetByTimestamp ++ searchLocalLog operation completes.

In DelayedRemoteListOffsets purgatory, we return the result when all the partitions results are received. Then, the delayedOperation gets completed.

We have ActionQueue to complete the pending actions but the LIST_OFFSETS request can be served by any replica (least loaded node). If the node serving the request doesn't have leadership for any of the partitions, then the result might not be complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, but "trigger the delayed operation completion in the same thread" seems to work by CompletableFuture.supplyAsync, right?

        CompletableFuture<Either<Exception, Option<FileRecords.TimestampAndOffset>>> taskFuture = CompletableFuture.supplyAsync(() -> {
            Either<Exception, Option<FileRecords.TimestampAndOffset>> result;
            try {
                // If it is not found in remote storage, then search in the local storage starting with local log start offset.
                Option<FileRecords.TimestampAndOffset> timestampAndOffsetOpt =
                        OptionConverters.toScala(findOffsetByTimestamp(topicPartition, timestamp, startingOffset, leaderEpochCache))
                                .orElse(searchLocalLog::get);
                result = Right.apply(timestampAndOffsetOpt);
            } catch (Exception e) {
                // NOTE: All the exceptions from the secondary storage are catched instead of only the KafkaException.
                LOGGER.error("Error occurred while reading the remote log offset for {}", topicPartition, e);
                result = Left.apply(e);
            } finally {
                TopicPartitionOperationKey key = new TopicPartitionOperationKey(topicPartition.topic(), topicPartition.partition());
                delayedRemoteListOffsetsPurgatory.checkAndComplete(key);
            }
            return result;
        }, remoteStorageReaderThreadPool);

I notice there is another similar pattern DelayedRemoteFetch, so it is OK to keep current design for consistency. However, it would be great to let me known (for my own education) what the side-effect happens if using CompletableFuture.supplyAsync :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work. When delayedRemoteListOffsetsPurgatory.checkAndComplete(key) is invoked, it calls DelayedRemoteListOffset#tryComplete which checks whether the taskFuture gets completed or not. It is not completed so the delayedOperation won't complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not completed so the delayedOperation won't complete.

oh, you are totally right, thanks!!!

fetchOnlyFromLeader)

val status = resultHolder match {
case OffsetResultHolder(Some(found), _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data structure gets complicated in this path. If those new structures serve for "remote" only, could you please consider defining a subclass of TimestampAndOffset to have data used by remote only?

or please add comments for those cases at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is large. I'll address the refactoring in the next PR. Is it fine?

please add comments for those cases at least?

Added comments. Let me know if it needs to be improved.

.find(_.maxTimestamp() == maxTimestampSoFar.timestamp)
.flatMap(batch => batch.offsetOfMaxTimestamp().asScala.map(new TimestampAndOffset(batch.maxTimestamp(), _,
Optional.of[Integer](batch.partitionLeaderEpoch()).filter(_ >= 0))))
OffsetResultHolder(timestampAndOffsetOpt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon me, why MAX_TIMESTAMP does not consider the records in the remote storage?

Copy link
Contributor Author

@kamalcph kamalcph Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over KIP-734, the purpose of MAX_TIMESTAMP is to get the offset of the record with highest timestamp in the partition:

Used to retrieve the offset with the largest timestamp of a partition as message timestamps can be specified client side this may not match the log end offset returned by LatestSpec

With remote storage enabled, all the passive segments might be uploaded to remote and removed from local-log. The local-log might contain only one empty active segment. We have to handle the MAX_TIMESTAMP case for remote storage. Thanks for catching this! Filed KAFKA-17552 to track this issue separately.

Copy link
Contributor Author

@kamalcph kamalcph Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over #15621 to see how do we handle the MAX_TIMESTAMP case for normal topics.

Now, we maintain the shallowOffsetOfMaxTimestampSoFar in LogSegment instead of the real-max-timestamp-offset. While uploading the LogSegment to remote, we create the RemoteLogSegmentMetadata event which holds the metadata information about the segment.

Even, if we pass the shallowOffsetOfMaxTimestampSoFar in the RemoteLogSegmentMetadata event, we have to download the remote-log-segment to find the real offsetOfMaxTimestampSoFar. This will increase the load on remote storage assuming that the original intention of the KIP-734 is to Confirming topic/partition "liveness" which means the Admin client will repeatedly invoke the list-offsets API for MAX_TIMESTAMP.

The predominant case to find the "Confirming topic/partition livness" is to query the local-log which will work as expected. For MAX_TIMESTAMP, when enabled with remote storage, the results can go wrong when:

  1. Assuming the timestamps are monotonic and all the passive segments are uploaded to remote and deleted from local. The only active segment in local disk is empty (post the log roll), then the max-timestamp will returned as "-1".
  2. Assuming the timestamps are non-monotonic, then the (timestamp, offset) returned by the API may not be "TRUE" (max-timestamp, max-timestamp-offset) as it considers the segments only in the local-log.

Should we handle/drop the MAX_TIMESTAMP case for topics enabled with remote storage? This can cause high load:

  1. On the RemoteLogMetadataManager, as we have to scan all the uploaded segment events to find the max-timestamp. Then, compare it with the max-timestamp computed from the local-log segments. And, the search should always proceed from remote to local storage.
  2. On the RemoteStorageManager, when there exists the MAX_TIMESTAMP record in the remote storage, then we have to download that segment (few bytes) repeatedly to serve the query.

In KIP-734, can we make a addendum to say that MAX_TIMESTAMP is not supported for topics enabled with remote storage? Note that when the KIP was proposed, the intention was not to read from the disk:

Snippet from KIP-734:

LogSegments track the highest timestamp and associated offset so we don't have to go to disk to fetch this

cc @satishd @showuon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even, if we pass the shallowOffsetOfMaxTimestampSoFar in the RemoteLogSegmentMetadata event, we have to download the remote-log-segment to find the real offsetOfMaxTimestampSoFar.

IMHO, we don't need to pass shallowOffsetOfMaxTimestampSoFar to RemoteLogSegmentMetadata as RemoteLogSegmentMetadata has maxTimestampMs already, so the basic behavior is listed below (same as you described I'd say)

  1. find the max timestamp from local segments
  2. query findOffsetByTimestamp by max timestamp from step_1
  3. compare the timestamp of record from remote to local to pick up correct offset

This will increase the load on remote storage assuming that the original intention of the KIP-734 is to Confirming topic/partition "liveness" which means the Admin client will repeatedly invoke the list-offsets API for MAX_TIMESTAMP.

The impl of KIP-734 was wrong because we don't loop all records in all path (because of cost issue). Hence, we rename the offsetOfMaxTimestampSoFar to shallowOffsetOfMaxTimestampSoFar based on true story.

Should we handle/drop the MAX_TIMESTAMP case for topics enabled with remote storage? This can cause high load:

that is a acceptable approach. We can REJECT the MAX_TIMESTAMP request for now as it is rare operation. Or we can make the call an async op too as it needs to iterate all metadata of remote segments.

override def tryComplete(): Boolean = {
var completable = true
metadata.statusByPartition.forKeyValue { (partition, status) =>
if (!status.completed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check the status of futures instead? for example:

  def completable = status.futureHolderOpt.isEmpty || status.futureHolderOpt.get.jobFuture.isDone

Copy link
Contributor Author

@kamalcph kamalcph Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the other comment, we need the status.completed variable as it is volatile and accessed by multiple threads for inter-thread visibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def completable = status.futureHolderOpt.isEmpty || status.futureHolderOpt.get.jobFuture.isDone

This method seems to be thread-safe. We can use this method instead of completed variable. But, I find using the variable makes the code clear. I'm open to change this. Will take this refactoring in the next PR.

metadata.statusByPartition.forKeyValue { (topicPartition, status) =>
status.completed = status.futureHolderOpt.isEmpty
if (status.futureHolderOpt.isDefined) {
status.responseOpt = Some(buildErrorResponse(Errors.REQUEST_TIMED_OUT, topicPartition.partition()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon me, why setting responseOpt early? If we keep the responseOpt be None, we can reuse responseOpt to evaluate "completed"

Copy link
Contributor Author

@kamalcph kamalcph Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's true. Assume that one LIST_OFFSETS request wants to query the offsetForTimestamp for 10 partitions, those 10 partitions are handled in concurrent fashion, provided that the remote-log-reader threads are available. If any one thread completes, then it marks that partition status as completed and checks for the statuses of all the other partitions.

The variable completed is accessed by multiple remote-log-reader threads so marked it as volatile and used it for computation instead of responseOpt.

Copy link
Contributor

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kamalcph for addressing the review comments. LGTM.

@kamalcph
Copy link
Contributor Author

JDK11 test failures are unrelated to this PR, the tests were timed out.

Found 2 flaky test failures:
FLAKY ⚠️  MetricsDuringTopicCreationDeletionTest > "testMetricsDuringTopicCreateDelete(String).quorum=zk"
FLAKY ⚠️  OffsetsApiIntegrationTest > testAlterSinkConnectorOffsetsDifferentKafkaClusterTargeted()
Read env GRADLE_EXIT_CODE: 124
Read env THREAD_DUMP_URL: https://github.com/apache/kafka/actions/runs/10854248301/artifacts/1932678337
Gradle command timed out. These are partial results!
25602 tests cases run in 6h14m4s. 23379 PASSED ✅, 0 FAILED ❌, 2 FLAKY ⚠️ , 18 SKIPPED 🙈, and 0 errors.
Failing this step because the tests timed out. Thread dumps were taken and archived here: https://github.com/apache/kafka/actions/runs/10854248301/artifacts/1932678337
Error: Process completed with exit code 1.

To find the offset for a given timestamp, ListOffsets API is used by the client. When the topic is enabled with remote storage, then we have to fetch the remote indexes such as offset-index and time-index to serve the query. Also, the ListOffsets request can contain the query for multiple topics/partitions.

The time taken to read the indexes from remote storage is non-deterministic and the query is handled by the request-handler threads. If there are multiple LIST_OFFSETS queries and most of the request-handler threads are busy in reading the data from remote storage, then the other high-priority requests such as PRODUCE and FETCH won't be handled and in worst-case can be dropped.

In this patch, we have introduced a delayed operation for remote list-offsets call. If the offsets need to be searched in the remote-storage, then the request-handler threads will pass-on the request to the remote-log-reader threads. And, the request gets handled in asynchronous fashion.
- Started the key-value pair from 0 to match with the offset number. (k0, v0) matches with the offset-0, this improves the test readability.
@chia7712
Copy link
Contributor

@kamalcph thanks for checking the failed tests. They pass on my local. will merge this PR

./gradlew cleanTest :tools:test --tests MetadataQuorumCommandTest.testDescribeQuorumReplicationSuccessful :core:test --tests ListOffsetsIntegrationTest.testThreeNonCompressedRecordsInSeparateBatch --tests ConsumerBounceTest.testConsumptionWithBrokerFailures

@chia7712
Copy link
Contributor

@kamalcph any updates? or just trigger QA again?

@kamalcph
Copy link
Contributor Author

kamalcph commented Sep 15, 2024

I rebased the branch against trunk to retrigger the tests again.

@chia7712
Copy link
Contributor

I rebased the branch against trunk to retrigger the tests again.

Got it

@chia7712 chia7712 merged commit 344d8a6 into apache:trunk Sep 15, 2024
7 of 9 checks passed
@kamalcph kamalcph deleted the KAFKA-15859a branch September 16, 2024 03:11
@kamalcph
Copy link
Contributor Author

Thank you all for the reviews!

@mumrah
Copy link
Contributor

mumrah commented Sep 16, 2024

@chia7712
Copy link
Contributor

@mumrah Sorry for that flaky. I will take a look!

@kamalcph
Copy link
Contributor Author

Opened #17214 to fix the flaky test. PTAL.

Copy link
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kamalcph : Thanks for the PR. Sorry for the late review. Added a few more comments.

@@ -1263,7 +1263,7 @@ class UnifiedLog(@volatile var logStartOffset: Long,
* None if no such message is found.
*/
@nowarn("cat=deprecation")
def fetchOffsetByTimestamp(targetTimestamp: Long, remoteLogManager: Option[RemoteLogManager] = None): Option[TimestampAndOffset] = {
def fetchOffsetByTimestamp(targetTimestamp: Long, remoteLogManager: Option[RemoteLogManager] = None): OffsetResultHolder = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the description of the return value accordingly?

Copy link
Contributor Author

@kamalcph kamalcph Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in #17487.

@@ -263,6 +270,10 @@ public RemoteLogManager(RemoteLogManagerConfig rlmConfig,
);
}

public void setDelayedOperationPurgatory(DelayedOperationPurgatory<DelayedRemoteListOffsets> delayedRemoteListOffsetsPurgatory) {
this.delayedRemoteListOffsetsPurgatory = delayedRemoteListOffsetsPurgatory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delayedRemoteListOffsetsPurgatory is written and read by different threads. Does it need to be volatile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other purgatories are also accessed by multiple threads but they don't have the volatile. So, followed the same approach. Let me know whether it is required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are talking about the purgatories in ReplicaManager? They are set during the creation of ReplicaManager. Here, delayedRemoteListOffsetsPurgatory is not set during the creation of RemoteLogManager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the volatile to the delayedRemoteListOffsetsPurgatory in RemoteLogManager. My understanding was that we instantiate the dataPlaneRequestProcessor after calling the ReplicaManager#startup, so there won't be an issue. It is good to be on the safer side.

}
}

class DelayedRemoteListOffsets(delayMs: Long,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplicaManager calls completeDelayedFetchOrProduceRequests when a replica is removed from the broker or becomes a follower to wake up pending produce/fetch request early. Should we do the same for pending remoteListOffset requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in #17487.

Retained the same method name. Shall we change the method name to completeDelayedFetchOrProduceOrRemoteListOffsetsRequests?

ReplicaManager.completeDelayedFetchOrProduceRequests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps naming it completeDelayedOperationsWhenNotPartitionLeader ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the method.

import scala.collection.{Map, mutable}
import scala.jdk.CollectionConverters._

case class ListOffsetsPartitionStatus(var responseOpt: Option[ListOffsetsPartitionResponse] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responseOpt can be written and read by different threads. Should it be volatile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in #17487.

// create a list of (topic, partition) pairs to use as keys for this delayed remote list offsets operation
val listOffsetsRequestKeys = statusByPartition.keys.map(TopicPartitionOperationKey(_)).toSeq
// try to complete the request immediately, otherwise put it into the purgatory
delayedRemoteListOffsetsPurgatory.tryCompleteElseWatch(delayedRemoteListOffsets, listOffsetsRequestKeys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listOffsetsRequestKeys is a bit weird. It is based on a topicPartition and the purgatory is checked on that key every time a remote listOffset task completes. However, the completion of such a task has no impact on other pending listOffset requests on the same partition.

The only reason we need the purgatory is really just for the expiration logic after the timeout if we chain all the futures together. Perhaps, using the pattern of DelayedFuturePurgatory is more intuitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through the DelayedFuturePurgatory and understood the changes required. This is a big change, so we can take this separately:

  1. DelayedFuturePurgatory does not emit any request expiration metrics.
  2. When a replica for a partition moves away, then we cannot complete the request for that partition as the watchKey is unique for each request.

Copy link
Contributor

@junrao junrao Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Where is the request expiration metric emitted now?
  2. Good point. We could have a customized DelayedFuturePurgatory that also adds a delayed operation key per partition. But they are triggered for completion check when the replica is no longer the leader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The expiration metrics are being emitted by the individual purgatory. (eg) DelayedRemoteListOffsetsMetrics#recordExpiration
  2. Agree, this will improve the performance. Will take the custom DelayedFuturePurgatory changes separately as it is a big change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed KAFKA-17797 to track this change.

}
}

case class ListOffsetsMetadata(statusByPartition: mutable.Map[TopicPartition, ListOffsetsPartitionStatus]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why do we need this wrapper class? Could we just use the Map directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in #17487.

@kamalcph
Copy link
Contributor Author

Thanks @junrao for the review comments! I will follow-up on them.

kamalcph added a commit to kamalcph/kafka that referenced this pull request Oct 12, 2024
…ca is removed from broker.

- Removed the ListOffsetsMetadata wrapper class.
- Addressed review comments from PR apache#16602
kamalcph added a commit to kamalcph/kafka that referenced this pull request Oct 12, 2024
…ca is removed from broker.

- Removed the ListOffsetsMetadata wrapper class.
- Addressed review comments from PR apache#16602
kamalcph added a commit to kamalcph/kafka that referenced this pull request Oct 12, 2024
…ca is removed from broker.

- Removed the ListOffsetsMetadata wrapper class.
- Addressed review comments from PR apache#16602
kamalcph added a commit to kamalcph/kafka that referenced this pull request Oct 12, 2024
…ca is removed from broker.

- Removed the ListOffsetsMetadata wrapper class.
- Addressed review comments from PR apache#16602
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tiered-storage Related to the Tiered Storage feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants