Restore: add and fill host info in restore progress #4088

Michal-Leszczynski · 2024-10-29T08:12:59Z

This PR creates and fills new swagger host restore progress definitions from #4082 in restore service.
It also updates managerclient to correctly display bandwidth and shard information in sctool progress.

Fixes #4042

Michal-Leszczynski · 2024-10-29T08:38:24Z

Examples:

Before first load&stream

miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         RUNNING (restoring backed-up data)
Start time:     29 Oct 24 09:27:27 CET
Duration:       10s
Progress:       0% | 37%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    316.128k/s
  - Load&stream: unknown

╭─────────────────────────────────────────────────┬──────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │ Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼──────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 0% | 37% │  86k │       0 │    32.245k │      0 │
╰─────────────────────────────────────────────────┴──────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   160.801k/s/shard │               unknown │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   155.433k/s/shard │               unknown │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

During restore

miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         RUNNING (restoring backed-up data)
Start time:     29 Oct 24 09:27:27 CET
Duration:       31s
Progress:       74% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    306.050k/s
  - Load&stream: 1.585k/s

╭─────────────────────────────────────────────────┬────────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │   Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼────────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 74% | 100% │  86k │ 64.368k │        86k │      0 │
╰─────────────────────────────────────────────────┴────────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   150.350k/s/shard │           813/s/shard │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   155.797k/s/shard │           810/s/shard │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

During repair

miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         DONE
Start time:     29 Oct 24 09:27:27 CET
End time:       29 Oct 24 09:28:02 CET
Duration:       34s
Progress:       100% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    306.050k/s
  - Load&stream: 1.415k/s


╭─────────────────────────────────────────────────┬─────────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │    Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼─────────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 100% | 100% │  86k │     86k │        86k │      0 │
╰─────────────────────────────────────────────────┴─────────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   150.350k/s/shard │           724/s/shard │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   155.797k/s/shard │           724/s/shard │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

karol-kokoszka · 2024-10-29T14:53:44Z

Why the kilobytes is (160.801k/s/shard) instead of (160.801kB/s/shard)

(813/s/shard) is bytes per second per shard ? B is missing.

karol-kokoszka · 2024-10-29T14:45:19Z

pkg/service/restore/progress.go

+		}
+		return endV.Sub(*start)
+	}
+	return 0


Suggested change

return 0

return time.Duration(0)

for clarity

karol-kokoszka

@Michal-Leszczynski idle time per node is missing

Michal-Leszczynski · 2024-10-30T08:21:20Z

Why the kilobytes is (160.801k/s/shard) instead of (160.801kB/s/shard)
(813/s/shard) is bytes per second per shard ? B is missing.

I just used the standard way in which we display bytes in managerclient.
I will update it to use KiB instead of k, etc... (in other places as well).

@Michal-Leszczynski idle time per node is missing

That's true! I forgot about your suggestion about that in the issue.

Unfortunately, it can't be calculated like Restore duration - (time reported as download) - (time reported as load & stream) since we don't really have Restore duration part. Note that the run duration displayed in sctool progress is the run duration, and not the entire task execution duration. So if restore was running for 1h and then was paused and resumed, the sctool progress would display duration close to 0, but we would like the idle time to be consistent across runs.

Of course, we could manually calculate entire task execution duration by traversing previous runs (on SM side), but even that won't solve the whole problem.

Another thing with this approach is that the total task execution duration also includes other time consuming restore stages (e.g. indexing, changing tombstone_gc, rebulding views, ...). This would result in reporting overestimated idle time, which could cause more harm than good.

In order to overcome that, we would need to know when download and load&stream stage started and finished, but we currently don't have such information.

For those reasons, I would prefer to skip idle time display, as it can still be observed via SM metrics.
@karol-kokoszka what do you think about it?

It was a left-over after feature development:/

It's going to be needed for calculating per shard download/stream bandwidth in progress command.

This commit also moves host shard info to the tablesWorker, as it is commonly reused during restore procedure.

This allows to calculate download/stream per shard bandwidth in 'sctool progress' display. s: add and fill host info in prog

Fixes #4042

It is nicer to see: "Size: 10B" instead of "Size: 10" or "Size: 20KiB" instead of "Size: 20k".

karol-kokoszka · 2024-10-30T10:02:54Z

Having information that node was spending time on something other that l&s or download helps with finding potential optimizations for the restore.
Seeing high bw for load & stream and high bw for download may be misleading when there is no information about how the node was utilized during the restore.

Examples from 3.3.3 shows that node can download fast, l&s fast but remain idle for most of the time.

Of course, we could manually calculate entire task execution duration by traversing previous runs (on SM side),

Yes, then let's do it.

Another thing with this approach is that the total task execution duration also includes other time consuming restore stages (e.g. indexing, changing tombstone_gc, rebulding views, ...). This would result in reporting overestimated idle time, which could cause more harm than good.

Call it other instead of idle then.

Michal-Leszczynski marked this pull request as ready for review October 29, 2024 09:34

Michal-Leszczynski requested a review from karol-kokoszka as a code owner October 29, 2024 09:34

karol-kokoszka approved these changes Oct 29, 2024

View reviewed changes

karol-kokoszka requested changes Oct 29, 2024

View reviewed changes

Michal-Leszczynski force-pushed the ml/restore-bw-fill-hosts branch from 2faa640 to 97fd869 Compare October 30, 2024 08:26

Michal-Leszczynski added 7 commits October 30, 2024 09:40

chore(go.mod): remove replace directive to SM submodules

8e6cf15

It was a left-over after feature development:/

chore(go.mod): bump SM submodules deps

0df8561

feat(schema): add shard cnt to restore_run_progress

1259abd

It's going to be needed for calculating per shard download/stream bandwidth in progress command.

feat(restore): add and fill shard cnt in restore run progress

f1b1206

This commit also moves host shard info to the tablesWorker, as it is commonly reused during restore procedure.

feat(restore): add and fill host info in progress

6f7fe57

This allows to calculate download/stream per shard bandwidth in 'sctool progress' display. s: add and fill host info in prog

feat(managerclient): display bandwidth in sctool progress

b9db113

Fixes #4042

feat(managerclient): include B or iB in SizeSuffix display

f960de8

It is nicer to see: "Size: 10B" instead of "Size: 10" or "Size: 20KiB" instead of "Size: 20k".

Michal-Leszczynski force-pushed the ml/restore-bw-fill-hosts branch from 97fd869 to f960de8 Compare October 30, 2024 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore: add and fill host info in restore progress #4088

Restore: add and fill host info in restore progress #4088

Michal-Leszczynski commented Oct 29, 2024 •

edited

Loading

Michal-Leszczynski commented Oct 29, 2024

karol-kokoszka commented Oct 29, 2024

karol-kokoszka Oct 29, 2024

karol-kokoszka left a comment

Michal-Leszczynski commented Oct 30, 2024

karol-kokoszka commented Oct 30, 2024

Restore: add and fill host info in restore progress #4088

Are you sure you want to change the base?

Restore: add and fill host info in restore progress #4088

Conversation

Michal-Leszczynski commented Oct 29, 2024 • edited Loading

Michal-Leszczynski commented Oct 29, 2024

karol-kokoszka commented Oct 29, 2024

karol-kokoszka Oct 29, 2024

Choose a reason for hiding this comment

karol-kokoszka left a comment

Choose a reason for hiding this comment

Michal-Leszczynski commented Oct 30, 2024

karol-kokoszka commented Oct 30, 2024

Michal-Leszczynski commented Oct 29, 2024 •

edited

Loading