Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Repro backing index is overlaping with backing index #193503

Closed

Conversation

nchaulet
Copy link
Member

@nchaulet nchaulet commented Sep 20, 2024

Summary

Reproduction for the following error when upgrading an integration after a rollback

\t\tillegal_argument_exception: backing index [.ds-metrics-no_tsdb_to_tsdb.test-default-2024.09.20-000002] with range [2024-09-19T22:49:57.000Z TO 2024-09-20T01:19:57.000Z] is overlapping with backing index [.ds-metrics-no_tsdb_to_tsdb.test-default-2024.09.20-000005] with range [2024-09-19T22:50:04.000Z TO 2024-09-20T01:20:04.000Z]

When the upgrade to TSDB for a datastream do not succeed and we rollback to a non TDSB version, the following upgrades are not working

Details on the failing scenario

  1. we install the package this create the datastream metrics-no_tsdb_to_tsdb.test-default without tsdb => this create a backing index 0001
  2. We upgrade, update metrics-no_tsdb_to_tsdb.test-default datastream to TSDB => this create a timeseries backing index 0002
  3. we rollback thepckage, update metrics-no_tsdb_to_tsdb.test- and rollover without tsb => this create backing indices 0003, 0004 without timeseries
  4. We try to upgrade again metrics-no_tsdb_to_tsdb.test-default datastream to TSDB => this fail with
backing index [.ds-metrics-no_tsdb_to_tsdb.test-default-2024.09.20-000002] with range [2024-09-19T22:49:57.000Z TO 2024-09-20T01:19:57.000Z] is overlapping with backing index [.ds-metrics-no_tsdb_to_tsdb.test-default-2024.09.20-000005] with range [2024-09-19T22:50:04.000Z TO 2024-09-20T01:20:04.000Z]

@martijnvg your name seems associated to a lot of TSDB in elasticsearch, maybe you can help me understand (or redirect me to someone who can) the behaviour here, and if either there is a bug in how the upgrade is handled in elasticsearch, or if there is something in Fleet we can do to avoid that, thanks a lot

@elasticmachine
Copy link
Contributor

elasticmachine commented Sep 20, 2024

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!

@obltmachine
Copy link

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@nchaulet nchaulet force-pushed the repro-overlapping-backing-index branch from b544687 to 9cdc840 Compare September 20, 2024 01:30
@nchaulet nchaulet force-pushed the repro-overlapping-backing-index branch from 9cdc840 to b2574a0 Compare September 20, 2024 12:46
@kibana-ci
Copy link
Collaborator

kibana-ci commented Sep 20, 2024

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #45 / EPM Endpoints EPM - get Installed Packages Allows the fetching of installed packages
  • [job] [logs] FTR Configs #45 / EPM Endpoints EPM - get Installed Packages Allows the fetching of installed packages

Metrics [docs]

✅ unchanged

History

  • 💔 Build #235902 failed 9cdc8404a881578f718263938fa5038a68daf549
  • 💔 Build #235803 failed b5446873af7b152080ebe2d8d8f4abb150751763

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

.send({ force: true })
.expect(200);

// Simulate rollover on upgrade it should throw
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rollover is lazy on ugprade it's why I triggered the rollover in that test

@salvatore-campagna
Copy link

@nchaulet what version of Elasticsearch are you using?

@nchaulet
Copy link
Member Author

@nchaulet what version of Elasticsearch are you using?

@salvatore-campagna the latest SNAPSHOT but we also have SDH having the same issue with 8.12.2

@salvatore-campagna
Copy link

salvatore-campagna commented Sep 23, 2024

So time_series indices have specific start_time and end_time settings which prevent the existence of overlapping indices with overlapping time ranges. This is happening because switching back and forth results in multiple time_series indices being created in the data stream. Probably the new index (after the latest upgrade) is created with a start_time that is smaller than end_time of an existing time_series index. What I can suggest is to try to wait for the most recent end_time among all the existing time_series indices in the data stream to expire, before upgrading.

@nchaulet
Copy link
Member Author

This is happening because switching back and forth results in multiple time_series indices being created in the data stream. Probably the new index (after the latest upgrade) is created with a start_time that is smaller than end_time of an existing time_series index.

@salvatore-campagna But should this is be handled when doing the rollover that create the new index? when we rollover an existing tsdb datastream it handle that no?

@salvatore-campagna
Copy link

salvatore-campagna commented Sep 24, 2024

This is happening because switching back and forth results in multiple time_series indices being created in the data stream. Probably the new index (after the latest upgrade) is created with a start_time that is smaller than end_time of an existing time_series index.

@salvatore-campagna But should this is be handled when doing the rollover that create the new index? when we rollover an existing tsdb datastream it handle that no?

There are two types of rollover operations occurring here (happening as a result of installing a new package/integration version):

  1. Rollover from a time_series index to a standard index: this happens when downgrading from TSDB (time-series database) to a standard index.
  2. Rollover from a standard index to a time_series index: this occurs when upgrading back to TSDB.

As a result, the data stream will contain at least three indices (though in your case, there are more). However, we are primarily concerned with the most recent three indices created:

  1. The first time_series index with start_time and end_time (let's call this index1).
  2. The standard index (index2).
  3. The most recent time_series index, again with start_time and end_time (index3).

When attempting the latest upgrade, you encounter a situation where index3.start_time < index1.end_time. This is not automatically handled during the rollover process. This is the time when the error happens. Backing time_series indices belonging to the same data stream are checked for overlapping time ranges, which are not allowed.

For this reason, I recommend waiting for index1.end_time to expire before initiating the next upgrade. By upgrading later, you ensure that when index3 is created, it will have index3.start_time > index1.end_time, avoiding the time overlap issue.

@nchaulet
Copy link
Member Author

When attempting the latest upgrade, you encounter a situation where index3.start_time < index1.end_time. This is not automatically handled during the rollover process. This is the time when the error happens. Backing time_series indices belonging to the same data stream are checked for overlapping time ranges, which are not allowed.

Could it be automatically handled during the rollover process? it seems to me it will be a better experience for the use. I can create an issue in ES for that improvments

For this reason, I recommend waiting for index1.end_time to expire before initiating the next upgrade. By upgrading later, you ensure that when index3 is created, it will have index3.start_time > index1.end_time, avoiding the time overlap issue.

One of the issue here is those upgrades/rollback come from automated processes in Fleet, and having to wait 4 hours is not really an ideal scenario

@salvatore-campagna
Copy link

When attempting the latest upgrade, you encounter a situation where index3.start_time < index1.end_time. This is not automatically handled during the rollover process. This is the time when the error happens. Backing time_series indices belonging to the same data stream are checked for overlapping time ranges, which are not allowed.

Could it be automatically handled during the rollover process? it seems to me it will be a better experience for the use. I can create an issue in ES for that improvments

For this reason, I recommend waiting for index1.end_time to expire before initiating the next upgrade. By upgrading later, you ensure that when index3 is created, it will have index3.start_time > index1.end_time, avoiding the time overlap issue.

One of the issue here is those upgrades/rollback come from automated processes in Fleet, and having to wait 4 hours is not really an ideal scenario

I see and I understand this might be an issue. If you can create an issue for us we will see what we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants