Conversation
9ed3283 to
fec6001
Compare
fec6001 to
d09affd
Compare
1. Support aggregated records. 2. Avoid expired shards blocking the stream.
d09affd to
a517e26
Compare
| assertTrue(checkpointAfterMerge.startsWith(streamName + ",")); | ||
| int initialShardCount = KinesisOffsetGen.CheckpointUtils.strToOffsets(checkpointAfterBatch1).size(); | ||
| int shardCountAfterMerge = KinesisOffsetGen.CheckpointUtils.strToOffsets(checkpointAfterMerge).size(); | ||
| assertTrue(shardCountAfterMerge > initialShardCount, |
There was a problem hiding this comment.
After merge, new child shards are generated, but the parent shards are still there and not expired yet. So ">" not "<"
01a34fe to
d43f8eb
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18224 +/- ##
============================================
- Coverage 61.43% 57.03% -4.41%
+ Complexity 23082 18520 -4562
============================================
Files 2108 1949 -159
Lines 127636 106587 -21049
Branches 14534 13196 -1338
============================================
- Hits 78409 60787 -17622
+ Misses 42873 40075 -2798
+ Partials 6354 5725 -629
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| record.approximateArrivalTimestamp().toEpochMilli()); | ||
| } | ||
| return OBJECT_MAPPER.writeValueAsString(node); | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
This catch (Exception e) silently drops the error and returns the raw string without any logging. If offset appending fails (e.g., data isn't valid JSON, or it's a JSON array rather than an object), you'd have no way to tell why some records have offsets and others don't. Could you at least log a warning here so data quality issues are debuggable?
| }); | ||
|
|
||
| // Cache so we can both get records and checkpoint from the same RDD | ||
| fetchRdd.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK()); |
There was a problem hiding this comment.
If toBatch is called a second time (e.g., during retry), the previous persistedFetchRdd is overwritten without being unpersisted first, leaking Spark storage memory. Could you add if (persistedFetchRdd != null) { persistedFetchRdd.unpersist(); } before the persist call, or is it handled somewhere else?
There was a problem hiding this comment.
Could we avoid persisting the RDDs? This can degrade the performance if spilling happens.
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
| while (allRecords.size() < maxTotalRecords && shardIterator != null) { | ||
| GetRecordsResponse response; | ||
| try { | ||
| response = client.getRecords( | ||
| GetRecordsRequest.builder() | ||
| .shardIterator(shardIterator) | ||
| .limit(Math.min(maxRecordsPerRequest, (int) (maxTotalRecords - allRecords.size()))) | ||
| .build()); | ||
| } catch (ExpiredIteratorException e) { | ||
| log.warn("Shard iterator expired for {} during GetRecords, stopping read", range.getShardId()); | ||
| break; | ||
| } catch (ProvisionedThroughputExceededException e) { | ||
| throw new HoodieReadFromSourceException("Kinesis throughput exceeded reading shard " + range.getShardId(), e); | ||
| } | ||
|
|
||
| List<Record> records = response.records(); | ||
| // Update shardIterator before the empty check so its null-ness correctly reflects end-of-shard | ||
| // even when the final response carries 0 records (closed shard fully exhausted). | ||
| shardIterator = response.nextShardIterator(); | ||
| // CASE 1: No records returned: stop polling. nextShardIterator can be non-null when at LATEST with no new | ||
| // data; continuing would cause an infinite loop of empty GetRecords calls. | ||
| if (records.isEmpty()) { | ||
| break; | ||
| } | ||
| // CASE 2: records returned. | ||
| List<Record> toAdd = enableDeaggregation ? KinesisDeaggregator.deaggregate(records) : records; | ||
| for (Record r : toAdd) { | ||
| allRecords.add(r); | ||
| } | ||
| // Checkpoint uses the last Kinesis record's sequence number (from raw records, not deaggregated) | ||
| lastSequenceNumber = records.get(records.size() - 1).sequenceNumber(); | ||
|
|
||
| requestCount++; | ||
| // This is for rate limiting | ||
| if (shardIterator != null && intervalMs > 0) { | ||
| Thread.sleep(intervalMs); | ||
| } | ||
| } |
There was a problem hiding this comment.
After deaggregation, allRecords can significantly exceed maxTotalRecords since one aggregated record can expand into many user records, but the while-loop only checks the limit before fetching. With aggressive KPL aggregation ratios (e.g., 100:1), a shard could return far more records than the configured per-shard limit. Have you considered truncating toAdd to maxTotalRecords - allRecords.size() before adding?
| // for test only | ||
| // LocalStack returns Long.MAX_VALUE for closed shards; use lastSeq as endSeq so we can detect | ||
| // "fully consumed" when the parent shard expires (lastSeq >= endSeq). | ||
| if (LOCALSTACK_END_SEQ_SENTINEL.equals(endSeq) && lastSeq != null && !lastSeq.isEmpty()) { | ||
| endSeq = lastSeq; | ||
| } |
There was a problem hiding this comment.
This sentinel check is commented "for test only" but runs in the production code path. Could we remove this from production code and use a different way to handle it?
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
| .withDocumentation("Starting position when no checkpoint exists. TRIM_HORIZON (or EARLIEST), or LATEST. Default: LATEST."); | ||
|
|
||
| public static final ConfigProperty<Integer> KINESIS_GET_RECORDS_MAX_RECORDS = ConfigProperty | ||
| .key(PREFIX + "getRecords.maxRecords") |
There was a problem hiding this comment.
| .key(PREFIX + "getRecords.maxRecords") | |
| .key(PREFIX + "max.records.per.request") |
| .defaultValue(10000) | ||
| .withAlternatives(OLD_PREFIX + "getRecords.maxRecords") | ||
| .markAdvanced() | ||
| .withDocumentation("Maximum number of records to fetch per GetRecords API call. Kinesis limit is 10000."); |
There was a problem hiding this comment.
Is there a validation on the config?
There was a problem hiding this comment.
What do you mean by validation? Try to test if this config is respected by Kinesis server?
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java
Show resolved
Hide resolved
| for (Record r : records) { | ||
| v1Records.add(toV1Record(r)); | ||
| } | ||
| List<UserRecord> userRecords = UserRecord.deaggregate(v1Records); | ||
| List<Record> result = new ArrayList<>(userRecords.size()); | ||
| for (UserRecord ur : userRecords) { | ||
| result.add(toV2Record(ur)); | ||
| } |
There was a problem hiding this comment.
Why is the conversion between V1 and V2 needed? Does the AWS SDK provide an API to do this?
| /** | ||
| * Extract lastSeq from checkpoint value (which may be "lastSeq" or "lastSeq|endSeq"). | ||
| */ | ||
| public static String getLastSeqFromValue(String value) { | ||
| if (value == null || value.isEmpty()) { | ||
| return value; | ||
| } | ||
| int sep = value.indexOf(END_SEQ_SEPARATOR); | ||
| return sep >= 0 ? value.substring(0, sep) : value; | ||
| } | ||
|
|
||
| /** | ||
| * Extract endSeq from checkpoint value if present. Returns null for open shards. | ||
| */ | ||
| public static String getEndSeqFromValue(String value) { | ||
| if (value == null || value.isEmpty()) { | ||
| return null; | ||
| } | ||
| int sep = value.indexOf(END_SEQ_SEPARATOR); | ||
| return sep >= 0 && sep < value.length() - 1 ? value.substring(sep + 1) : null; | ||
| } |
There was a problem hiding this comment.
Merge these two together to return Pair<String, Option<String>> of last seq and optional end seq: getLastAndEndSeqFromCheckpoint
|
|
||
| /** LocalStack returns Long.MAX_VALUE for closed shards' endingSequenceNumber; real AWS returns actual value. */ | ||
| public static final String LOCALSTACK_END_SEQ_SENTINEL = "9223372036854775807"; | ||
|
|
There was a problem hiding this comment.
Avoid this test-only code in the production class.
| // LocalStack sentinel: when lastSeq equals sentinel, we've fully consumed | ||
| if (LOCALSTACK_END_SEQ_SENTINEL.equals(endSeq) && LOCALSTACK_END_SEQ_SENTINEL.equals(lastSeq)) { | ||
| return false; |
| public KinesisOffsetGen(TypedProperties props) { | ||
| this.props = props; | ||
| checkRequiredConfigProperties(props, | ||
| Arrays.asList(KinesisSourceConfig.KINESIS_STREAM_NAME, KinesisSourceConfig.KINESIS_REGION)); |
| .markAdvanced() | ||
| .withDocumentation("Fail when checkpoint references an expired shard instead of seeking to TRIM_HORIZON."); | ||
|
|
||
| public static final ConfigProperty<String> KINESIS_STARTING_POSITION = ConfigProperty |
There was a problem hiding this comment.
| public static final ConfigProperty<String> KINESIS_STARTING_POSITION = ConfigProperty | |
| public static final ConfigProperty<KinesisStartingPosition> KINESIS_STARTING_POSITION = ConfigProperty |
| String secretKey = getStringWithAltKeys(props, KinesisSourceConfig.KINESIS_SECRET_KEY, null); | ||
| if (accessKey != null && !accessKey.isEmpty() && secretKey != null && !secretKey.isEmpty()) { | ||
| builder = builder.credentialsProvider( | ||
| StaticCredentialsProvider.create(AwsBasicCredentials.create(accessKey, secretKey))); |
There was a problem hiding this comment.
Could HoodieConfigAWSCredentialsProvider be reused?
| throw new HoodieReadFromSourceException("Kinesis throughput exceeded listing shards for " + streamName, e); | ||
| } catch (LimitExceededException e) { | ||
| throw new HoodieReadFromSourceException("Kinesis limit exceeded listing shards: " + e.getMessage(), e); | ||
| } |
There was a problem hiding this comment.
Should we also catch other exceptions? If not, where are those caught or thrown in the upper caller chain?
| throw new HoodieReadFromSourceException("Kinesis limit exceeded listing shards: " + e.getMessage(), e); | ||
| } | ||
| allShards.addAll(response.shards()); | ||
| nextToken = response.nextToken(); |
There was a problem hiding this comment.
Is this for paginated responses? Is there an API handling this for reuse, instead of hand-crafting the logic again?
| long sourceLimit, | ||
| HoodieIngestionMetrics metrics) { | ||
| long maxEvents = getLongWithAltKeys(props, KinesisSourceConfig.MAX_EVENTS_FROM_KINESIS_SOURCE); | ||
| long numEvents = sourceLimit == Long.MAX_VALUE ? maxEvents : Math.min(sourceLimit, maxEvents); |
There was a problem hiding this comment.
It looks like this diverges from Kafka source's behavior of picking sourceLimit if it is not Long.MAX_VALUE.
| return streamName + "," + parts; | ||
| } | ||
|
|
||
| public static boolean checkStreamCheckpoint(Option<String> lastCheckpointStr) { |
There was a problem hiding this comment.
| public static boolean checkStreamCheckpoint(Option<String> lastCheckpointStr) { | |
| public static boolean isStreamCheckpointValid(Option<String> lastCheckpointStr) { |
| // CASE: last checkpoint exists. | ||
| if (lastCheckpointStr.isPresent() && CheckpointUtils.checkStreamCheckpoint(lastCheckpointStr)) { | ||
| Map<String, String> checkpointOffsets = CheckpointUtils.strToOffsets(lastCheckpointStr.get()); | ||
| if (!checkpointOffsets.isEmpty() && lastCheckpointStr.get().startsWith(streamName + ",")) { |
There was a problem hiding this comment.
Add lastCheckpointStr.get().startsWith(streamName + ",") to CheckpointUtils#checkStreamCheckpoint?
| if (endSeq != null) { | ||
| // CASE 1: lastSeq >= endSeq: all records have been consumed. | ||
| fullyConsumed = lastSeq != null && lastSeq.compareTo(endSeq) >= 0; | ||
| } else { | ||
| // CASE 2: lastSeq < endSeq: some records haven't been consumed. | ||
| // CASE 3: endSeq == null: open shard. | ||
| fullyConsumed = false; | ||
| } |
There was a problem hiding this comment.
Could hasUnreadRecords be reused here?
| boolean fullyConsumed; | ||
| if (endSeq != null) { | ||
| // CASE 1: lastSeq >= endSeq: all records have been consumed. | ||
| fullyConsumed = lastSeq != null && lastSeq.compareTo(endSeq) >= 0; |
There was a problem hiding this comment.
Why could lastSeq be larger than endSeq? I assume lastSeq should always be less than or equal to endSeq.
| for (String shardId : availableShardIds) { | ||
| if (checkpointOffsets.containsKey(shardId)) { | ||
| String lastSeq = CheckpointUtils.getLastSeqFromValue(checkpointOffsets.get(shardId)); | ||
| if (lastSeq != null && !lastSeq.isEmpty()) { |
There was a problem hiding this comment.
lastSeq should not be empty, otherwise it should throw an error. It would be good to add ValidationUtils#checkArgument.
| int targetParallelism = minPartitions > 0 | ||
| ? (int) Math.max(minPartitions, ranges.size()) | ||
| : ranges.size(); | ||
| metrics.updateStreamerSourceParallelism(targetParallelism); |
There was a problem hiding this comment.
The minPartitions or targetParallelism is not used to determine the parallelism or the ranges like Kafka source. Do we want to consider that in case the number of shards is low?
| public KinesisShardRange[] getNextShardRanges(Option<Checkpoint> lastCheckpoint, | ||
| long sourceLimit, | ||
| HoodieIngestionMetrics metrics) { |
There was a problem hiding this comment.
We should also integrate sourceProfileSupplier (SourceProfileSupplier) or make sure to track that as as a follow-up.
There was a problem hiding this comment.
After learning more about Kinesis, it looks like there is rate limiting per shard (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html#kds-api-limits):
- GetRecords: 5 transactions per second, The maximum number of records that can be returned per call is 10,000. The maximum size of data that GetRecords can return is 10 MB. If a call returns this amount of data, subsequent calls made within the next 5 seconds throw ProvisionedThroughputExceededException. If there is insufficient provisioned throughput on the stream, subsequent calls made within the next 1 second throw ProvisionedThroughputExceededException.
Based on this, further splitting a shard into reading from multiple executors may not be helpful, which is different from Kafka. I think it would be good to document these aspects.
| @AllArgsConstructor | ||
| @Getter | ||
| public static class ShardReadResult implements java.io.Serializable { | ||
| private final List<Record> records; |
There was a problem hiding this comment.
We should avoid storing the read records in a list which can significantly increase the memory usage. We should use the iterator pattern to reduce the memory pressure on the executor side.
| /** | ||
| * Read records from a single shard. | ||
| * @param enableDeaggregation when true, de-aggregates KPL records into individual user records | ||
| */ | ||
| public static ShardReadResult readShardRecords(KinesisClient client, String streamName, | ||
| KinesisShardRange range, KinesisSourceConfig.KinesisStartingPosition defaultPosition, | ||
| int maxRecordsPerRequest, long intervalMs, long maxTotalRecords, | ||
| boolean enableDeaggregation) throws InterruptedException { | ||
| String shardIterator; | ||
| try { | ||
| shardIterator = getShardIterator(client, streamName, range, defaultPosition); | ||
| } catch (InvalidArgumentException e) { | ||
| // GetShardIterator throws InvalidArgumentException (not ExpiredIteratorException) when the | ||
| // requested sequence number is past the stream's retention window. | ||
| throw new HoodieReadFromSourceException("Sequence number in checkpoint is expired or invalid for shard " | ||
| + range.getShardId() + ". Reset the checkpoint to recover.", e); | ||
| } catch (ResourceNotFoundException e) { | ||
| throw new HoodieReadFromSourceException("Shard or stream not found: " + range.getShardId(), e); | ||
| } catch (ProvisionedThroughputExceededException e) { | ||
| throw new HoodieReadFromSourceException("Kinesis throughput exceeded reading shard " + range.getShardId(), e); | ||
| } | ||
| List<Record> allRecords = new ArrayList<>(); | ||
| String lastSequenceNumber = null; | ||
| int requestCount = 0; | ||
|
|
||
| while (allRecords.size() < maxTotalRecords && shardIterator != null) { | ||
| GetRecordsResponse response; | ||
| try { | ||
| response = client.getRecords( | ||
| GetRecordsRequest.builder() | ||
| .shardIterator(shardIterator) | ||
| .limit(Math.min(maxRecordsPerRequest, (int) (maxTotalRecords - allRecords.size()))) | ||
| .build()); | ||
| } catch (ExpiredIteratorException e) { | ||
| log.warn("Shard iterator expired for {} during GetRecords, stopping read", range.getShardId()); | ||
| break; | ||
| } catch (ProvisionedThroughputExceededException e) { | ||
| throw new HoodieReadFromSourceException("Kinesis throughput exceeded reading shard " + range.getShardId(), e); | ||
| } | ||
|
|
||
| List<Record> records = response.records(); | ||
| // Update shardIterator before the empty check so its null-ness correctly reflects end-of-shard | ||
| // even when the final response carries 0 records (closed shard fully exhausted). | ||
| shardIterator = response.nextShardIterator(); | ||
| // CASE 1: No records returned: stop polling. nextShardIterator can be non-null when at LATEST with no new | ||
| // data; continuing would cause an infinite loop of empty GetRecords calls. | ||
| if (records.isEmpty()) { | ||
| break; | ||
| } | ||
| // CASE 2: records returned. | ||
| List<Record> toAdd = enableDeaggregation ? KinesisDeaggregator.deaggregate(records) : records; | ||
| for (Record r : toAdd) { | ||
| allRecords.add(r); | ||
| } | ||
| // Checkpoint uses the last Kinesis record's sequence number (from raw records, not deaggregated) | ||
| lastSequenceNumber = records.get(records.size() - 1).sequenceNumber(); | ||
|
|
||
| requestCount++; | ||
| // This is for rate limiting | ||
| if (shardIterator != null && intervalMs > 0) { | ||
| Thread.sleep(intervalMs); | ||
| } | ||
| } |
There was a problem hiding this comment.
Could we wrap the logic here into a closable iterator that can be directly used by the executor in JsonKinesisSource#toBatch without accumulating records in memory and then returning the iterator?
| // This is for rate limiting | ||
| if (shardIterator != null && intervalMs > 0) { | ||
| Thread.sleep(intervalMs); | ||
| } |
There was a problem hiding this comment.
Could we remove this as it does self rate-limiting? Instead, we should let the SDK to do the retries after rate limiting.
| builder.shardIteratorType(defaultPosition == KinesisSourceConfig.KinesisStartingPosition.TRIM_HORIZON | ||
| ? ShardIteratorType.TRIM_HORIZON : ShardIteratorType.LATEST); |
There was a problem hiding this comment.
KinesisSourceConfig.KinesisStartingPosition.EARLIEST should also map to ShardIteratorType.TRIM_HORIZON.
| return new ShardReadResult(allRecords, Option.ofNullable(lastSequenceNumber), shardIterator == null); | ||
| } | ||
|
|
||
| private static String getShardIterator(KinesisClient client, String streamName, |
There was a problem hiding this comment.
| private static String getShardIterator(KinesisClient client, String streamName, | |
| private static String getCurrentCursor(KinesisClient client, String streamName, |
| KinesisShardRange range, KinesisSourceConfig.KinesisStartingPosition defaultPosition, | ||
| int maxRecordsPerRequest, long intervalMs, long maxTotalRecords, | ||
| boolean enableDeaggregation) throws InterruptedException { | ||
| String shardIterator; |
There was a problem hiding this comment.
| String shardIterator; | |
| String currentCursor; |
| allRecords.add(r); | ||
| } | ||
| // Checkpoint uses the last Kinesis record's sequence number (from raw records, not deaggregated) | ||
| lastSequenceNumber = records.get(records.size() - 1).sequenceNumber(); |
There was a problem hiding this comment.
This assumes that the returned records are sorted based on the sequence number. Is that guaranteed?
|
|
||
| // Filter out shards with no unread records to avoid unnecessary GetRecords calls | ||
| boolean useLatestWhenNoCheckpoint = | ||
| offsetGen.getStartingPosition() == KinesisSourceConfig.KinesisStartingPosition.LATEST; |
There was a problem hiding this comment.
Rename this to startingStrategy or something similar to be readable?
| boolean useLatestWhenNoCheckpoint = | ||
| offsetGen.getStartingPosition() == KinesisSourceConfig.KinesisStartingPosition.LATEST; | ||
| KinesisOffsetGen.KinesisShardRange[] allShardRanges = shardRanges; | ||
| int beforeFilter = shardRanges.length; |
There was a problem hiding this comment.
| int beforeFilter = shardRanges.length; | |
| int lengthBeforeFilter = shardRanges.length; |
| String checkpointStr = lastCheckpoint.isPresent() ? lastCheckpoint.get().getCheckpointKey() : ""; | ||
| return new InputBatch<>(Option.empty(), checkpointStr); |
There was a problem hiding this comment.
Could there be a case where the checkpoint needs to be set, instead of empty string, and there is no message to ingest?
| KinesisOffsetGen.KinesisShardRange[] shardRanges = offsetGen.getNextShardRanges( | ||
| lastCheckpoint, sourceLimit, metrics); |
There was a problem hiding this comment.
For all the methods handling shard ranges, could you use List<KinesisOffsetGen.KinesisShardRange> instead of array so it's easier to read?
| /** | ||
| * Create checkpoint string from the batch and shard ranges. | ||
| * Subclasses provide checkpoint data (shardId -> sequenceNumber) collected during the read. | ||
| * Must include both read shards (from shardRangesRead) and filtered shards (from allShardRanges) | ||
| * so the next run does not re-read filtered-out shards from TRIM_HORIZON. | ||
| */ | ||
| protected abstract String createCheckpointFromBatch(T batch, | ||
| KinesisOffsetGen.KinesisShardRange[] shardRangesRead, | ||
| KinesisOffsetGen.KinesisShardRange[] allShardRanges); |
There was a problem hiding this comment.
Could shardRangesRead and allShardRanges be renamed to be easily understood?
| KinesisOffsetGen.KinesisShardRange[] shardRanges = offsetGen.getNextShardRanges( | ||
| lastCheckpoint, sourceLimit, metrics); |
There was a problem hiding this comment.
To clarify, offsetGen.getNextShardRanges returns all the open and closed ranges (excluding expired ranges) from the checkpoint and current correct?
| // Handle expired shards that exist in the last checkpoint. | ||
| if (!expiredShardIds.isEmpty()) { | ||
| boolean failOnDataLoss = getBooleanWithAltKeys(props, KinesisSourceConfig.ENABLE_FAIL_ON_DATA_LOSS); | ||
| for (String shardId : expiredShardIds) { |
There was a problem hiding this comment.
There should also be validation on open and closed shards. If the last sequence number of a shard in last checkpoint is before the start of the shard based on the current state, e.g., due to data retention, there is also data loss.
| <!-- AWS Kinesis SDK for JsonKinesisSource --> | ||
| <include>software.amazon.awssdk:kinesis</include> | ||
|
|
There was a problem hiding this comment.
Should we add this to hudi-aws-bundle so that it is not added here? I don't see any other AWS artifacts included.
| private static class ShardFetchResult implements Serializable { | ||
| private final List<String> records; | ||
| private final String shardId; | ||
| private final Option<String> lastSequenceNumber; | ||
| private final boolean reachedEndOfShard; | ||
| } |
There was a problem hiding this comment.
This class is similar to KinesisOffsetGen.ShardReadResult. Could we keep one of them only?
| }); | ||
|
|
||
| // Cache so we can both get records and checkpoint from the same RDD | ||
| fetchRdd.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK()); |
There was a problem hiding this comment.
Could we avoid persisting the RDDs? This can degrade the performance if spilling happens.
| public KinesisShardRange[] getNextShardRanges(Option<Checkpoint> lastCheckpoint, | ||
| long sourceLimit, | ||
| HoodieIngestionMetrics metrics) { |
There was a problem hiding this comment.
After learning more about Kinesis, it looks like there is rate limiting per shard (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html#kds-api-limits):
- GetRecords: 5 transactions per second, The maximum number of records that can be returned per call is 10,000. The maximum size of data that GetRecords can return is 10 MB. If a call returns this amount of data, subsequent calls made within the next 5 seconds throw ProvisionedThroughputExceededException. If there is insufficient provisioned throughput on the stream, subsequent calls made within the next 1 second throw ProvisionedThroughputExceededException.
Based on this, further splitting a shard into reading from multiple executors may not be helpful, which is different from Kafka. I think it would be good to document these aspects.
| long totalMsgs = getRecordCount(batch); | ||
| metrics.updateStreamerSourceNewMessageCount(METRIC_NAME_KINESIS_MESSAGE_IN_COUNT, totalMsgs); | ||
|
|
||
| log.info("Read {} records from Kinesis stream {} with {} shards, checkpoint: {}", | ||
| totalMsgs, offsetGen.getStreamName(), shardRanges.length, checkpointStr); |
There was a problem hiding this comment.
This triggers eager evaluation and record reading. Could we avoid that?
| private final boolean shouldAddOffsets; | ||
| private final boolean enableDeaggregation; | ||
| private final int maxRecordsPerRequest; | ||
| private final long intervalMs; |
|
|
||
| JavaRDD<ShardFetchResult> fetchRdd = sparkContext.parallelize( | ||
| java.util.Arrays.asList(shardRanges), shardRanges.length) | ||
| .mapPartitions(shardRangeIt -> { |
There was a problem hiding this comment.
Any reason using mapPartitions instead of map?
There was a problem hiding this comment.
Generally mapPartitions is more efficient than map since some resources can be reused, like the client if this partition are assigned multiple shards. But here probably no big differences since we give 1 shard per partition.
| String json = recordToJsonStatic(r, range.getShardId(), readConfig.isShouldAddOffsets()); | ||
| if (json != null) { | ||
| recordStrings.add(json); | ||
| } |
There was a problem hiding this comment.
Should it throw a runtime exception if json is null?
| readConfig.getMaxRecordsPerRequest(), readConfig.getIntervalMs(), readConfig.getMaxRecordsPerShard(), | ||
| readConfig.isEnableDeaggregation()); | ||
|
|
||
| List<String> recordStrings = new ArrayList<>(); |
There was a problem hiding this comment.
Similarly, could we avoid accumulating all JSON strings in a list, and construct an iterator instead?
| private static KinesisClient createKinesisClientFromConfig(KinesisReadConfig config) { | ||
| software.amazon.awssdk.services.kinesis.KinesisClientBuilder builder = | ||
| KinesisClient.builder().region(software.amazon.awssdk.regions.Region.of(config.getRegion())); | ||
| if (config.getEndpointUrl() != null && !config.getEndpointUrl().isEmpty()) { | ||
| builder = builder.endpointOverride(java.net.URI.create(config.getEndpointUrl())); | ||
| } | ||
| if (config.getAccessKey() != null && !config.getAccessKey().isEmpty() | ||
| && config.getSecretKey() != null && !config.getSecretKey().isEmpty()) { | ||
| builder = builder.credentialsProvider( | ||
| software.amazon.awssdk.auth.credentials.StaticCredentialsProvider.create( | ||
| software.amazon.awssdk.auth.credentials.AwsBasicCredentials.create( | ||
| config.getAccessKey(), config.getSecretKey()))); | ||
| } | ||
| return builder.build(); | ||
| } |
There was a problem hiding this comment.
KinesisOffsetGen has createKinesisClient with similar logic. Let's consolidate these two into one.
| if (dataStr == null || dataStr.trim().isEmpty()) { | ||
| return null; | ||
| } | ||
| if (shouldAddOffsets) { |
There was a problem hiding this comment.
nit: rename the config and variable to be aligned with Kinesis
| return dataStr; | ||
| } | ||
|
|
||
| private Map<String, String> buildCheckpointFromSummaries(List<ShardFetchSummary> summaries) { |
There was a problem hiding this comment.
Does this contain all shards including the ones that are filtered out (i.e., shards without new data or read in this batch)?
| } | ||
|
|
||
| @Override | ||
| protected String createCheckpointFromBatch(JavaRDD<String> batch, |
There was a problem hiding this comment.
The checkpoint calculation logic is spread in multiple methods (toBatch, buildCheckpointFromSummaries, createCheckpointFromBatch, etc.). Could that be consolidated into one place?
Describe the issue this Pull Request addresses
#18228
This PR adds AWS Kinesis Data Streams as a source for the Hudi DeltaStreamer, so users can ingest JSON records from Kinesis Data Streams into Hudi tables.
Previously, DeltaStreamer supported Kafka, JDBC, DFS (Parquet, CSV, ORC), and SQL sources, but not Kinesis.
Summary and Changelog
Adds JsonKinesisSource – reads JSON from AWS Kinesis Data Streams
Adds KinesisSource – base abstraction for Kinesis sources
Adds KinesisOffsetGen – handles shard iteration, checkpointing, and resumable reads
Introduces KinesisSourceConfig and KinesisReadConfig for configuration
Adds KinesisTestUtils for LocalStack-based tests
Integrates with existing DeltaStreamer and streamer metrics
Impact
DeltaStreamer users can use Kinesis as a source alongside Kafka and others
New ingestion path: Kinesis → DeltaStreamer → Hudi
Adds an optional AWS Kinesis dependency; non-Kinesis use cases are unaffected
Tests run against LocalStack, so no live AWS credentials are required
Risk Level
Low
Documentation Update
Contributor's checklist