feat: Create JsonKinesisSource by linliu-code · Pull Request #18224 · apache/hudi

linliu-code · 2026-02-19T22:07:15Z

Describe the issue this Pull Request addresses

This PR adds AWS Kinesis Data Streams as a source for the Hudi DeltaStreamer, so users can ingest JSON records from Kinesis Data Streams into Hudi tables.
Previously, DeltaStreamer supported Kafka, JDBC, DFS (Parquet, CSV, ORC), and SQL sources, but not Kinesis.

Summary and Changelog

Adds JsonKinesisSource – reads JSON from AWS Kinesis Data Streams
Adds KinesisSource – base abstraction for Kinesis sources
Adds KinesisOffsetGen – handles shard iteration, checkpointing, and resumable reads
Introduces KinesisSourceConfig and KinesisReadConfig for configuration
Adds KinesisTestUtils for LocalStack-based tests
Integrates with existing DeltaStreamer and streamer metrics

Impact

DeltaStreamer users can use Kinesis as a source alongside Kafka and others
New ingestion path: Kinesis → DeltaStreamer → Hudi
Adds an optional AWS Kinesis dependency; non-Kinesis use cases are unaffected
Tests run against LocalStack, so no live AWS credentials are required

Risk Level

Low

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

1. Support aggregated records. 2. Avoid expired shards blocking the stream.

linliu-code · 2026-02-20T23:58:58Z

...utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java

+      assertTrue(checkpointAfterMerge.startsWith(streamName + ","));
+      int initialShardCount = KinesisOffsetGen.CheckpointUtils.strToOffsets(checkpointAfterBatch1).size();
+      int shardCountAfterMerge = KinesisOffsetGen.CheckpointUtils.strToOffsets(checkpointAfterMerge).size();
+      assertTrue(shardCountAfterMerge > initialShardCount,


After merge, new child shards are generated, but the parent shards are still there and not expired yet. So ">" not "<"

codecov-commenter · 2026-02-21T10:15:13Z

Codecov Report

❌ Patch coverage is 0% with 455 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.03%. Comparing base (ec04479) to head (a28e9e7).
⚠️ Report is 988 commits behind head on master.

Files with missing lines	Patch %	Lines
...di/utilities/sources/helpers/KinesisOffsetGen.java	0.00%	201 Missing ⚠️
...ache/hudi/utilities/sources/JsonKinesisSource.java	0.00%	118 Missing ⚠️
...che/hudi/utilities/config/KinesisSourceConfig.java	0.00%	81 Missing ⚠️
...g/apache/hudi/utilities/sources/KinesisSource.java	0.00%	29 Missing ⚠️
...utilities/sources/helpers/KinesisDeaggregator.java	0.00%	26 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18224      +/-   ##
============================================
- Coverage     61.43%   57.03%   -4.41%     
+ Complexity    23082    18520    -4562     
============================================
  Files          2108     1949     -159     
  Lines        127636   106587   -21049     
  Branches      14534    13196    -1338     
============================================
- Hits          78409    60787   -17622     
+ Misses        42873    40075    -2798     
+ Partials       6354     5725     -629

Flag	Coverage Δ
hadoop-mr-java-client	`45.41% <ø> (?)`
spark-java-tests	`47.22% <0.00%> (?)`
spark-scala-tests	`45.30% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...utilities/sources/helpers/KinesisDeaggregator.java	`0.00% <0.00%> (ø)`
...g/apache/hudi/utilities/sources/KinesisSource.java	`0.00% <0.00%> (ø)`
...che/hudi/utilities/config/KinesisSourceConfig.java	`0.00% <0.00%> (ø)`
...ache/hudi/utilities/sources/JsonKinesisSource.java	`0.00% <0.00%> (ø)`
...di/utilities/sources/helpers/KinesisOffsetGen.java	`0.00% <0.00%> (ø)`

... and 4009 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-02-23T19:37:59Z

CI report:

9ed3283 UNKNOWN
a28e9e7 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2026-02-23T19:14:04Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+              record.approximateArrivalTimestamp().toEpochMilli());
+        }
+        return OBJECT_MAPPER.writeValueAsString(node);
+      } catch (Exception e) {


This catch (Exception e) silently drops the error and returns the raw string without any logging. If offset appending fails (e.g., data isn't valid JSON, or it's a JSON array rather than an object), you'd have no way to tell why some records have offsets and others don't. Could you at least log a warning here so data quality issues are debuggable?

yihua · 2026-02-23T19:14:04Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+        });
+
+    // Cache so we can both get records and checkpoint from the same RDD
+    fetchRdd.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK());


If toBatch is called a second time (e.g., during retry), the previous persistedFetchRdd is overwritten without being unpersisted first, leaking Spark storage memory. Could you add if (persistedFetchRdd != null) { persistedFetchRdd.unpersist(); } before the persist call, or is it handled somewhere else?

Could we avoid persisting the RDDs? This can degrade the performance if spilling happens.

hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java

yihua · 2026-02-23T20:51:46Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+    while (allRecords.size() < maxTotalRecords && shardIterator != null) {
+      GetRecordsResponse response;
+      try {
+        response = client.getRecords(
+            GetRecordsRequest.builder()
+                .shardIterator(shardIterator)
+                .limit(Math.min(maxRecordsPerRequest, (int) (maxTotalRecords - allRecords.size())))
+                .build());
+      } catch (ExpiredIteratorException e) {
+        log.warn("Shard iterator expired for {} during GetRecords, stopping read", range.getShardId());
+        break;
+      } catch (ProvisionedThroughputExceededException e) {
+        throw new HoodieReadFromSourceException("Kinesis throughput exceeded reading shard " + range.getShardId(), e);
+      }
+
+      List<Record> records = response.records();
+      // Update shardIterator before the empty check so its null-ness correctly reflects end-of-shard
+      // even when the final response carries 0 records (closed shard fully exhausted).
+      shardIterator = response.nextShardIterator();
+      // CASE 1: No records returned: stop polling. nextShardIterator can be non-null when at LATEST with no new
+      // data; continuing would cause an infinite loop of empty GetRecords calls.
+      if (records.isEmpty()) {
+        break;
+      }
+      // CASE 2: records returned.
+      List<Record> toAdd = enableDeaggregation ? KinesisDeaggregator.deaggregate(records) : records;
+      for (Record r : toAdd) {
+        allRecords.add(r);
+      }
+      // Checkpoint uses the last Kinesis record's sequence number (from raw records, not deaggregated)
+      lastSequenceNumber = records.get(records.size() - 1).sequenceNumber();
+
+      requestCount++;
+      // This is for rate limiting
+      if (shardIterator != null && intervalMs > 0) {
+        Thread.sleep(intervalMs);
+      }
+    }


After deaggregation, allRecords can significantly exceed maxTotalRecords since one aggregated record can expand into many user records, but the while-loop only checks the limit before fetching. With aggressive KPL aggregation ratios (e.g., 100:1), a shard could return far more records than the configured per-shard limit. Have you considered truncating toAdd to maxTotalRecords - allRecords.size() before adding?

yihua · 2026-02-23T20:53:51Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+      // for test only
+      // LocalStack returns Long.MAX_VALUE for closed shards; use lastSeq as endSeq so we can detect
+      // "fully consumed" when the parent shard expires (lastSeq >= endSeq).
+      if (LOCALSTACK_END_SEQ_SENTINEL.equals(endSeq) && lastSeq != null && !lastSeq.isEmpty()) {
+        endSeq = lastSeq;
+      }


This sentinel check is commented "for test only" but runs in the production code path. Could we remove this from production code and use a different way to handle it?

hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java

yihua · 2026-02-23T21:04:03Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java

+      .withDocumentation("Starting position when no checkpoint exists. TRIM_HORIZON (or EARLIEST), or LATEST. Default: LATEST.");
+
+  public static final ConfigProperty<Integer> KINESIS_GET_RECORDS_MAX_RECORDS = ConfigProperty
+      .key(PREFIX + "getRecords.maxRecords")


Suggested change

.key(PREFIX + "getRecords.maxRecords")

.key(PREFIX + "max.records.per.request")

yihua · 2026-02-23T21:04:15Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java

+      .defaultValue(10000)
+      .withAlternatives(OLD_PREFIX + "getRecords.maxRecords")
+      .markAdvanced()
+      .withDocumentation("Maximum number of records to fetch per GetRecords API call. Kinesis limit is 10000.");


Is there a validation on the config?

What do you mean by validation? Try to test if this config is respected by Kinesis server?

hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java

yihua · 2026-02-23T21:16:01Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisDeaggregator.java

+    for (Record r : records) {
+      v1Records.add(toV1Record(r));
+    }
+    List<UserRecord> userRecords = UserRecord.deaggregate(v1Records);
+    List<Record> result = new ArrayList<>(userRecords.size());
+    for (UserRecord ur : userRecords) {
+      result.add(toV2Record(ur));
+    }


Why is the conversion between V1 and V2 needed? Does the AWS SDK provide an API to do this?

yihua · 2026-02-23T21:21:27Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+    /**
+     * Extract lastSeq from checkpoint value (which may be "lastSeq" or "lastSeq|endSeq").
+     */
+    public static String getLastSeqFromValue(String value) {
+      if (value == null || value.isEmpty()) {
+        return value;
+      }
+      int sep = value.indexOf(END_SEQ_SEPARATOR);
+      return sep >= 0 ? value.substring(0, sep) : value;
+    }
+
+    /**
+     * Extract endSeq from checkpoint value if present. Returns null for open shards.
+     */
+    public static String getEndSeqFromValue(String value) {
+      if (value == null || value.isEmpty()) {
+        return null;
+      }
+      int sep = value.indexOf(END_SEQ_SEPARATOR);
+      return sep >= 0 && sep < value.length() - 1 ? value.substring(sep + 1) : null;
+    }


Merge these two together to return Pair<String, Option<String>> of last seq and optional end seq: getLastAndEndSeqFromCheckpoint

yihua · 2026-02-23T21:23:43Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+
+  /** LocalStack returns Long.MAX_VALUE for closed shards' endingSequenceNumber; real AWS returns actual value. */
+  public static final String LOCALSTACK_END_SEQ_SENTINEL = "9223372036854775807";
+


Avoid this test-only code in the production class.

yihua · 2026-02-23T21:40:21Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+      // LocalStack sentinel: when lastSeq equals sentinel, we've fully consumed
+      if (LOCALSTACK_END_SEQ_SENTINEL.equals(endSeq) && LOCALSTACK_END_SEQ_SENTINEL.equals(lastSeq)) {
+        return false;


Remove this test-only code.

yihua · 2026-02-23T21:46:17Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+  public KinesisOffsetGen(TypedProperties props) {
+    this.props = props;
+    checkRequiredConfigProperties(props,
+        Arrays.asList(KinesisSourceConfig.KINESIS_STREAM_NAME, KinesisSourceConfig.KINESIS_REGION));


Is KINESIS_REGION required?

yihua · 2026-02-23T21:51:47Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KinesisSourceConfig.java

+      .markAdvanced()
+      .withDocumentation("Fail when checkpoint references an expired shard instead of seeking to TRIM_HORIZON.");
+
+  public static final ConfigProperty<String> KINESIS_STARTING_POSITION = ConfigProperty


Suggested change

public static final ConfigProperty<String> KINESIS_STARTING_POSITION = ConfigProperty

public static final ConfigProperty<KinesisStartingPosition> KINESIS_STARTING_POSITION = ConfigProperty

yihua · 2026-02-23T21:54:20Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+    String secretKey = getStringWithAltKeys(props, KinesisSourceConfig.KINESIS_SECRET_KEY, null);
+    if (accessKey != null && !accessKey.isEmpty() && secretKey != null && !secretKey.isEmpty()) {
+      builder = builder.credentialsProvider(
+          StaticCredentialsProvider.create(AwsBasicCredentials.create(accessKey, secretKey)));


Could HoodieConfigAWSCredentialsProvider be reused?

yihua · 2026-02-23T21:56:01Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+        throw new HoodieReadFromSourceException("Kinesis throughput exceeded listing shards for " + streamName, e);
+      } catch (LimitExceededException e) {
+        throw new HoodieReadFromSourceException("Kinesis limit exceeded listing shards: " + e.getMessage(), e);
+      }


Should we also catch other exceptions? If not, where are those caught or thrown in the upper caller chain?

yihua · 2026-02-23T21:57:04Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+        throw new HoodieReadFromSourceException("Kinesis limit exceeded listing shards: " + e.getMessage(), e);
+      }
+      allShards.addAll(response.shards());
+      nextToken = response.nextToken();


Is this for paginated responses? Is there an API handling this for reuse, instead of hand-crafting the logic again?

yihua · 2026-02-23T22:01:06Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+                                                long sourceLimit,
+                                                HoodieIngestionMetrics metrics) {
+    long maxEvents = getLongWithAltKeys(props, KinesisSourceConfig.MAX_EVENTS_FROM_KINESIS_SOURCE);
+    long numEvents = sourceLimit == Long.MAX_VALUE ? maxEvents : Math.min(sourceLimit, maxEvents);


It looks like this diverges from Kafka source's behavior of picking sourceLimit if it is not Long.MAX_VALUE.

yihua · 2026-02-23T22:04:54Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+      return streamName + "," + parts;
+    }
+
+    public static boolean checkStreamCheckpoint(Option<String> lastCheckpointStr) {


Suggested change

public static boolean checkStreamCheckpoint(Option<String> lastCheckpointStr) {

public static boolean isStreamCheckpointValid(Option<String> lastCheckpointStr) {

yihua · 2026-02-23T22:05:58Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+      // CASE: last checkpoint exists.
+      if (lastCheckpointStr.isPresent() && CheckpointUtils.checkStreamCheckpoint(lastCheckpointStr)) {
+        Map<String, String> checkpointOffsets = CheckpointUtils.strToOffsets(lastCheckpointStr.get());
+        if (!checkpointOffsets.isEmpty() && lastCheckpointStr.get().startsWith(streamName + ",")) {


Add lastCheckpointStr.get().startsWith(streamName + ",") to CheckpointUtils#checkStreamCheckpoint?

yihua · 2026-02-24T20:26:07Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+              if (endSeq != null) {
+                // CASE 1: lastSeq >= endSeq: all records have been consumed.
+                fullyConsumed = lastSeq != null && lastSeq.compareTo(endSeq) >= 0;
+              } else {
+                // CASE 2: lastSeq < endSeq: some records haven't been consumed.
+                // CASE 3: endSeq == null: open shard.
+                fullyConsumed = false;
+              }


Could hasUnreadRecords be reused here?

yihua · 2026-02-24T20:26:45Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+              boolean fullyConsumed;
+              if (endSeq != null) {
+                // CASE 1: lastSeq >= endSeq: all records have been consumed.
+                fullyConsumed = lastSeq != null && lastSeq.compareTo(endSeq) >= 0;


Why could lastSeq be larger than endSeq? I assume lastSeq should always be less than or equal to endSeq.

yihua · 2026-02-24T20:33:59Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+          for (String shardId : availableShardIds) {
+            if (checkpointOffsets.containsKey(shardId)) {
+              String lastSeq = CheckpointUtils.getLastSeqFromValue(checkpointOffsets.get(shardId));
+              if (lastSeq != null && !lastSeq.isEmpty()) {


lastSeq should not be empty, otherwise it should throw an error. It would be good to add ValidationUtils#checkArgument.

yihua · 2026-02-24T20:38:18Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+      int targetParallelism = minPartitions > 0
+          ? (int) Math.max(minPartitions, ranges.size())
+          : ranges.size();
+      metrics.updateStreamerSourceParallelism(targetParallelism);


The minPartitions or targetParallelism is not used to determine the parallelism or the ranges like Kafka source. Do we want to consider that in case the number of shards is low?

yihua · 2026-02-24T20:42:42Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+  public KinesisShardRange[] getNextShardRanges(Option<Checkpoint> lastCheckpoint,
+                                                long sourceLimit,
+                                                HoodieIngestionMetrics metrics) {


We should also integrate sourceProfileSupplier (SourceProfileSupplier) or make sure to track that as as a follow-up.

After learning more about Kinesis, it looks like there is rate limiting per shard (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html#kds-api-limits):

GetRecords: 5 transactions per second, The maximum number of records that can be returned per call is 10,000. The maximum size of data that GetRecords can return is 10 MB. If a call returns this amount of data, subsequent calls made within the next 5 seconds throw ProvisionedThroughputExceededException. If there is insufficient provisioned throughput on the stream, subsequent calls made within the next 1 second throw ProvisionedThroughputExceededException.

Based on this, further splitting a shard into reading from multiple executors may not be helpful, which is different from Kafka. I think it would be good to document these aspects.

yihua · 2026-02-24T20:52:26Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+  @AllArgsConstructor
+  @Getter
+  public static class ShardReadResult implements java.io.Serializable {
+    private final List<Record> records;


We should avoid storing the read records in a list which can significantly increase the memory usage. We should use the iterator pattern to reduce the memory pressure on the executor side.

yihua · 2026-02-24T20:55:40Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+  /**
+   * Read records from a single shard.
+   * @param enableDeaggregation when true, de-aggregates KPL records into individual user records
+   */
+  public static ShardReadResult readShardRecords(KinesisClient client, String streamName,
+      KinesisShardRange range, KinesisSourceConfig.KinesisStartingPosition defaultPosition,
+      int maxRecordsPerRequest, long intervalMs, long maxTotalRecords,
+      boolean enableDeaggregation) throws InterruptedException {
+    String shardIterator;
+    try {
+      shardIterator = getShardIterator(client, streamName, range, defaultPosition);
+    } catch (InvalidArgumentException e) {
+      // GetShardIterator throws InvalidArgumentException (not ExpiredIteratorException) when the
+      // requested sequence number is past the stream's retention window.
+      throw new HoodieReadFromSourceException("Sequence number in checkpoint is expired or invalid for shard "
+          + range.getShardId() + ". Reset the checkpoint to recover.", e);
+    } catch (ResourceNotFoundException e) {
+      throw new HoodieReadFromSourceException("Shard or stream not found: " + range.getShardId(), e);
+    } catch (ProvisionedThroughputExceededException e) {
+      throw new HoodieReadFromSourceException("Kinesis throughput exceeded reading shard " + range.getShardId(), e);
+    }
+    List<Record> allRecords = new ArrayList<>();
+    String lastSequenceNumber = null;
+    int requestCount = 0;
+
+    while (allRecords.size() < maxTotalRecords && shardIterator != null) {
+      GetRecordsResponse response;
+      try {
+        response = client.getRecords(
+            GetRecordsRequest.builder()
+                .shardIterator(shardIterator)
+                .limit(Math.min(maxRecordsPerRequest, (int) (maxTotalRecords - allRecords.size())))
+                .build());
+      } catch (ExpiredIteratorException e) {
+        log.warn("Shard iterator expired for {} during GetRecords, stopping read", range.getShardId());
+        break;
+      } catch (ProvisionedThroughputExceededException e) {
+        throw new HoodieReadFromSourceException("Kinesis throughput exceeded reading shard " + range.getShardId(), e);
+      }
+
+      List<Record> records = response.records();
+      // Update shardIterator before the empty check so its null-ness correctly reflects end-of-shard
+      // even when the final response carries 0 records (closed shard fully exhausted).
+      shardIterator = response.nextShardIterator();
+      // CASE 1: No records returned: stop polling. nextShardIterator can be non-null when at LATEST with no new
+      // data; continuing would cause an infinite loop of empty GetRecords calls.
+      if (records.isEmpty()) {
+        break;
+      }
+      // CASE 2: records returned.
+      List<Record> toAdd = enableDeaggregation ? KinesisDeaggregator.deaggregate(records) : records;
+      for (Record r : toAdd) {
+        allRecords.add(r);
+      }
+      // Checkpoint uses the last Kinesis record's sequence number (from raw records, not deaggregated)
+      lastSequenceNumber = records.get(records.size() - 1).sequenceNumber();
+
+      requestCount++;
+      // This is for rate limiting
+      if (shardIterator != null && intervalMs > 0) {
+        Thread.sleep(intervalMs);
+      }
+    }


Could we wrap the logic here into a closable iterator that can be directly used by the executor in JsonKinesisSource#toBatch without accumulating records in memory and then returning the iterator?

yihua · 2026-02-24T20:56:28Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+      // This is for rate limiting
+      if (shardIterator != null && intervalMs > 0) {
+        Thread.sleep(intervalMs);
+      }


Could we remove this as it does self rate-limiting? Instead, we should let the SDK to do the retries after rate limiting.

yihua · 2026-02-24T21:07:04Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+      builder.shardIteratorType(defaultPosition == KinesisSourceConfig.KinesisStartingPosition.TRIM_HORIZON
+          ? ShardIteratorType.TRIM_HORIZON : ShardIteratorType.LATEST);


KinesisSourceConfig.KinesisStartingPosition.EARLIEST should also map to ShardIteratorType.TRIM_HORIZON.

yihua · 2026-02-24T21:11:29Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+    return new ShardReadResult(allRecords, Option.ofNullable(lastSequenceNumber), shardIterator == null);
+  }
+
+  private static String getShardIterator(KinesisClient client, String streamName,


Suggested change

private static String getShardIterator(KinesisClient client, String streamName,

private static String getCurrentCursor(KinesisClient client, String streamName,

yihua · 2026-02-24T21:11:45Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+      KinesisShardRange range, KinesisSourceConfig.KinesisStartingPosition defaultPosition,
+      int maxRecordsPerRequest, long intervalMs, long maxTotalRecords,
+      boolean enableDeaggregation) throws InterruptedException {
+    String shardIterator;


Suggested change

String shardIterator;

String currentCursor;

yihua · 2026-02-24T21:13:44Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+        allRecords.add(r);
+      }
+      // Checkpoint uses the last Kinesis record's sequence number (from raw records, not deaggregated)
+      lastSequenceNumber = records.get(records.size() - 1).sequenceNumber();


This assumes that the returned records are sorted based on the sequence number. Is that guaranteed?

yihua · 2026-02-24T21:26:59Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KinesisSource.java

+
+    // Filter out shards with no unread records to avoid unnecessary GetRecords calls
+    boolean useLatestWhenNoCheckpoint =
+        offsetGen.getStartingPosition() == KinesisSourceConfig.KinesisStartingPosition.LATEST;


Rename this to startingStrategy or something similar to be readable?

yihua · 2026-02-24T21:28:52Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KinesisSource.java

+    boolean useLatestWhenNoCheckpoint =
+        offsetGen.getStartingPosition() == KinesisSourceConfig.KinesisStartingPosition.LATEST;
+    KinesisOffsetGen.KinesisShardRange[] allShardRanges = shardRanges;
+    int beforeFilter = shardRanges.length;


Suggested change

int beforeFilter = shardRanges.length;

int lengthBeforeFilter = shardRanges.length;

yihua · 2026-02-24T21:32:17Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KinesisSource.java

+      String checkpointStr = lastCheckpoint.isPresent() ? lastCheckpoint.get().getCheckpointKey() : "";
+      return new InputBatch<>(Option.empty(), checkpointStr);


Could there be a case where the checkpoint needs to be set, instead of empty string, and there is no message to ingest?

hudi-utilities/pom.xml

yihua · 2026-02-25T00:10:00Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KinesisSource.java

+    KinesisOffsetGen.KinesisShardRange[] shardRanges = offsetGen.getNextShardRanges(
+        lastCheckpoint, sourceLimit, metrics);


For all the methods handling shard ranges, could you use List<KinesisOffsetGen.KinesisShardRange> instead of array so it's easier to read?

yihua · 2026-02-25T16:12:47Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KinesisSource.java

+  /**
+   * Create checkpoint string from the batch and shard ranges.
+   * Subclasses provide checkpoint data (shardId -> sequenceNumber) collected during the read.
+   * Must include both read shards (from shardRangesRead) and filtered shards (from allShardRanges)
+   * so the next run does not re-read filtered-out shards from TRIM_HORIZON.
+   */
+  protected abstract String createCheckpointFromBatch(T batch,
+      KinesisOffsetGen.KinesisShardRange[] shardRangesRead,
+      KinesisOffsetGen.KinesisShardRange[] allShardRanges);


Could shardRangesRead and allShardRanges be renamed to be easily understood?

yihua · 2026-02-25T16:16:55Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KinesisSource.java

+    KinesisOffsetGen.KinesisShardRange[] shardRanges = offsetGen.getNextShardRanges(
+        lastCheckpoint, sourceLimit, metrics);


To clarify, offsetGen.getNextShardRanges returns all the open and closed ranges (excluding expired ranges) from the checkpoint and current correct?

yihua · 2026-02-25T16:21:49Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+          // Handle expired shards that exist in the last checkpoint.
+          if (!expiredShardIds.isEmpty()) {
+            boolean failOnDataLoss = getBooleanWithAltKeys(props, KinesisSourceConfig.ENABLE_FAIL_ON_DATA_LOSS);
+            for (String shardId : expiredShardIds) {


There should also be validation on open and closed shards. If the last sequence number of a shard in last checkpoint is before the start of the shard based on the current state, e.g., due to data retention, there is also data loss.

yihua · 2026-02-25T16:27:46Z

packaging/hudi-utilities-bundle/pom.xml

+                  <!-- AWS Kinesis SDK for JsonKinesisSource -->
+                  <include>software.amazon.awssdk:kinesis</include>
+


Should we add this to hudi-aws-bundle so that it is not added here? I don't see any other AWS artifacts included.

yihua · 2026-02-25T19:37:10Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+  private static class ShardFetchResult implements Serializable {
+    private final List<String> records;
+    private final String shardId;
+    private final Option<String> lastSequenceNumber;
+    private final boolean reachedEndOfShard;
+  }


This class is similar to KinesisOffsetGen.ShardReadResult. Could we keep one of them only?

yihua · 2026-02-25T19:42:30Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+        });
+
+    // Cache so we can both get records and checkpoint from the same RDD
+    fetchRdd.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK());


Could we avoid persisting the RDDs? This can degrade the performance if spilling happens.

yihua · 2026-02-26T00:27:29Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisOffsetGen.java

+  public KinesisShardRange[] getNextShardRanges(Option<Checkpoint> lastCheckpoint,
+                                                long sourceLimit,
+                                                HoodieIngestionMetrics metrics) {


After learning more about Kinesis, it looks like there is rate limiting per shard (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html#kds-api-limits):

GetRecords: 5 transactions per second, The maximum number of records that can be returned per call is 10,000. The maximum size of data that GetRecords can return is 10 MB. If a call returns this amount of data, subsequent calls made within the next 5 seconds throw ProvisionedThroughputExceededException. If there is insufficient provisioned throughput on the stream, subsequent calls made within the next 1 second throw ProvisionedThroughputExceededException.

Based on this, further splitting a shard into reading from multiple executors may not be helpful, which is different from Kafka. I think it would be good to document these aspects.

yihua · 2026-02-26T00:32:03Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KinesisSource.java

+    long totalMsgs = getRecordCount(batch);
+    metrics.updateStreamerSourceNewMessageCount(METRIC_NAME_KINESIS_MESSAGE_IN_COUNT, totalMsgs);
+
+    log.info("Read {} records from Kinesis stream {} with {} shards, checkpoint: {}",
+        totalMsgs, offsetGen.getStreamName(), shardRanges.length, checkpointStr);


This triggers eager evaluation and record reading. Could we avoid that?

yihua · 2026-02-26T00:37:07Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KinesisReadConfig.java

+  private final boolean shouldAddOffsets;
+  private final boolean enableDeaggregation;
+  private final int maxRecordsPerRequest;
+  private final long intervalMs;


nit: rename for readability

yihua · 2026-02-26T00:38:40Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+
+    JavaRDD<ShardFetchResult> fetchRdd = sparkContext.parallelize(
+        java.util.Arrays.asList(shardRanges), shardRanges.length)
+        .mapPartitions(shardRangeIt -> {


Any reason using mapPartitions instead of map?

Generally mapPartitions is more efficient than map since some resources can be reused, like the client if this partition are assigned multiple shards. But here probably no big differences since we give 1 shard per partition.

yihua · 2026-02-26T00:39:41Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+                String json = recordToJsonStatic(r, range.getShardId(), readConfig.isShouldAddOffsets());
+                if (json != null) {
+                  recordStrings.add(json);
+                }


Should it throw a runtime exception if json is null?

yihua · 2026-02-26T00:40:36Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+                  readConfig.getMaxRecordsPerRequest(), readConfig.getIntervalMs(), readConfig.getMaxRecordsPerShard(),
+                  readConfig.isEnableDeaggregation());
+
+              List<String> recordStrings = new ArrayList<>();


Similarly, could we avoid accumulating all JSON strings in a list, and construct an iterator instead?

yihua · 2026-02-26T00:44:03Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+  private static KinesisClient createKinesisClientFromConfig(KinesisReadConfig config) {
+    software.amazon.awssdk.services.kinesis.KinesisClientBuilder builder =
+        KinesisClient.builder().region(software.amazon.awssdk.regions.Region.of(config.getRegion()));
+    if (config.getEndpointUrl() != null && !config.getEndpointUrl().isEmpty()) {
+      builder = builder.endpointOverride(java.net.URI.create(config.getEndpointUrl()));
+    }
+    if (config.getAccessKey() != null && !config.getAccessKey().isEmpty()
+        && config.getSecretKey() != null && !config.getSecretKey().isEmpty()) {
+      builder = builder.credentialsProvider(
+          software.amazon.awssdk.auth.credentials.StaticCredentialsProvider.create(
+              software.amazon.awssdk.auth.credentials.AwsBasicCredentials.create(
+                  config.getAccessKey(), config.getSecretKey())));
+    }
+    return builder.build();
+  }


KinesisOffsetGen has createKinesisClient with similar logic. Let's consolidate these two into one.

yihua · 2026-02-26T00:45:27Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+    if (dataStr == null || dataStr.trim().isEmpty()) {
+      return null;
+    }
+    if (shouldAddOffsets) {


nit: rename the config and variable to be aligned with Kinesis

yihua · 2026-02-26T00:49:34Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+    return dataStr;
+  }
+
+  private Map<String, String> buildCheckpointFromSummaries(List<ShardFetchSummary> summaries) {


Does this contain all shards including the ones that are filtered out (i.e., shards without new data or read in this batch)?

yihua · 2026-02-26T00:52:49Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKinesisSource.java

+  }
+
+  @Override
+  protected String createCheckpointFromBatch(JavaRDD<String> batch,


The checkpoint calculation logic is spread in multiple methods (toBatch, buildCheckpointFromSummaries, createCheckpointFromBatch, etc.). Could that be consolidated into one place?

linliu-code added 2 commits February 19, 2026 12:14

Add KinesisSource

889eebc

Fix some issues

8c44c27

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 19, 2026

Add tests

fec6001

linliu-code force-pushed the create_kinesis_source branch from 9ed3283 to fec6001 Compare February 19, 2026 23:01

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Feb 19, 2026

linliu-code force-pushed the create_kinesis_source branch from fec6001 to d09affd Compare February 20, 2026 00:11

Add more tests

016110c

linliu-code marked this pull request as ready for review February 20, 2026 01:30

linliu-code changed the title ~~feat:Create KinesisSource~~ feat: Create JsonKinesisSource Feb 20, 2026

linliu-code added 3 commits February 20, 2026 08:54

handle spark parallelism

dc3a956

Add several critical features:

e9d68bb

1. Support aggregated records. 2. Avoid expired shards blocking the stream.

Add more tests for these corner cases

a517e26

linliu-code force-pushed the create_kinesis_source branch from d09affd to a517e26 Compare February 20, 2026 21:27

Fix CI failures

82ade47

linliu-code commented Feb 20, 2026

View reviewed changes

Fix a performance issue

d43f8eb

linliu-code force-pushed the create_kinesis_source branch from 01a34fe to d43f8eb Compare February 21, 2026 10:01

linliu-code added 4 commits February 23, 2026 05:14

Filter empty shards before read`

917c414

Readability

7c0370d

Add more tests

bf77db3

Add more tests and a bug fix

a28e9e7

yihua reviewed Feb 23, 2026

View reviewed changes

yihua reviewed Feb 24, 2026

View reviewed changes

yihua reviewed Feb 25, 2026

View reviewed changes

yihua reviewed Feb 26, 2026

View reviewed changes

	.key(PREFIX + "getRecords.maxRecords")
	.key(PREFIX + "max.records.per.request")


		/** LocalStack returns Long.MAX_VALUE for closed shards' endingSequenceNumber; real AWS returns actual value. */
		public static final String LOCALSTACK_END_SEQ_SENTINEL = "9223372036854775807";

	public static final ConfigProperty<String> KINESIS_STARTING_POSITION = ConfigProperty
	public static final ConfigProperty<KinesisStartingPosition> KINESIS_STARTING_POSITION = ConfigProperty

	public static boolean checkStreamCheckpoint(Option<String> lastCheckpointStr) {
	public static boolean isStreamCheckpointValid(Option<String> lastCheckpointStr) {

		builder.shardIteratorType(defaultPosition == KinesisSourceConfig.KinesisStartingPosition.TRIM_HORIZON
		? ShardIteratorType.TRIM_HORIZON : ShardIteratorType.LATEST);

	private static String getShardIterator(KinesisClient client, String streamName,
	private static String getCurrentCursor(KinesisClient client, String streamName,

	int beforeFilter = shardRanges.length;
	int lengthBeforeFilter = shardRanges.length;

		String checkpointStr = lastCheckpoint.isPresent() ? lastCheckpoint.get().getCheckpointKey() : "";
		return new InputBatch<>(Option.empty(), checkpointStr);

		KinesisOffsetGen.KinesisShardRange[] shardRanges = offsetGen.getNextShardRanges(
		lastCheckpoint, sourceLimit, metrics);

		<!-- AWS Kinesis SDK for JsonKinesisSource -->
		<include>software.amazon.awssdk:kinesis</include>

Conversation

linliu-code commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hudi-bot commented Feb 23, 2026

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

linliu-code commented Feb 19, 2026 •

edited

Loading

codecov-commenter commented Feb 21, 2026 •

edited

Loading